###### Abstract

We design and analyze TS-Cascade, a Thompson sampling algorithm for the cascading bandit problem. In TS-Cascade, Bayesian estimates of the click probability are constructed using a univariate Gaussian; this leads to a more efficient exploration procedure vis-à-vis existing UCB-based approaches. We also incorporate the empirical variance of each item’s click probability into the Bayesian updates. These two novel features allow us to prove an expected regret bound of the form where and are the number of ground items and the number of items in the chosen list respectively and is the number of Thompson sampling update steps. This matches the state-of-the-art regret bounds for UCB-based algorithms. More importantly, it is the first theoretical guarantee on a Thompson sampling algorithm for any stochastic combinatorial bandit problem model with partial feedback. Empirical experiments demonstrate superiority of TS-Cascade compared to existing UCB-based procedures in terms of the expected cumulative regret and the time complexity.

Thompson Sampling for Cascading Bandits

Wang Chi Cheung Vincent Y. F. Tan Zixin Zhong

IHPC, A*STAR National University of Singapore National University of Singapore

## 1 Introduction

Online recommender systems seek to recommend a small list of items (such as movies or hotels) to users based on a larger ground set of items. The model we consider in this paper is the cascading bandits model (Kveton et al., 2015a). In the standard cascade model of Craswell et al. (2008), which is used widely in information retrieval and online advertising, the user, upon seeing this set of items, scans through it in a sequential manner. She looks at the first item and if she is attracted by it, clicks on it. If not, she skips to the next item and clicks on it if she finds it attractive. This process stops when she clicks on one item in the list or when she comes to the end of the list, in which case she is not attracted by any of the items. The items that are in the ground set but not in the chosen list and those in the list that come after the attractive one are unobserved. Each item , which has a certain click probability , attracts the user independently of other items. Under this assumption, the optimal solution is the list of items that maximizes the probability that the user finds an attractive item. This is precisely the list of the most attractive items.

In the multi-armed bandits version of the cascade model (Kveton et al., 2015a), the click probabilities of the items are unknown to the learning agent, and should be learned over time. If the user clicks on any item in the list, a reward of one is obtained by the learning agent. Otherwise, no reward is obtained. Based on the lists previously chosen and the rewards obtained thus far, the agent tries to learn the click probabilities (exploration) in order to adaptively and judiciously recommend other lists of items (exploitation) to maximize his overall reward after time steps.

Main Contributions. We design and analyze TS-Cascade, a Thompson sampling algorithm (Thompson, 1933) for the cascading bandits problem. Our design involves the two novel features. First, the Bayesian estimates on the vector of latent click probabilities are constructed by a univariate Gaussian distribution. Consequently, in each time step, Ts-Cascade conducts exploration in a suitably defined one-dimensional space. This leads to a more efficient exploration procedure than the existing Upper Confidence Bound (UCB) approaches, which conduct exploration in -dimensional confidence hypercubes. Second, inspired by Audibert et al. (2009), we judiciously incorporate the empirical variance of each item’s click probability in the Bayesian update. The allows efficient exploration on item when is close to or .

We establish a problem independent regret bound for our proposed algorithm TS-Cascade. Our regret bound matches the state-of-the-art regret bound for UCB algorithms on the cascading bandit model (Wang and Chen, 2017), up to a multiplicative logarithmic factor in the number of time steps , when . Our regret bound is the first theoretical guarantee on a Thompson sampling algorithm for the cascading bandit problem model, or for any stochastic combinatorial bandit problem model with partial feedback (see literature review). This addresses an open question on Thompson sampling raised by Zong et al. (2016), in the special case of no linear generalization. In our analysis, we successfully disentangle the statistical dependence between partial monitoring and Thompson sampling, by analyzing a suitably weighted version of the Thompson samples (Lemma 4). In addition, we reconcile the statistical inconsistency in using Gaussian random variables to model click probabilities by considering a certain truncated version of the Thompson samples (Lemma 4).

Literature Review. Our work is closely related to existing works on the class of stochastic combinatorial bandit (SCB) problems and Thompson sampling. In an SCB model, an arm corresponds to a subset of a ground set of items, each associated with a latent random variable. The corresponding reward depends on the constituent items’ realized random variables. SCB models with semi-bandit feedback, where a learning agent observes all random variables of the items in a pulled arm, are extensively studied in existing works. Assuming semi-bandit feedback, Anantharam et al. (1987) study the case when the arms constitute a uniform matroid, Kveton et al. (2014) study the case of general matroids, Gai et al. (2010) study the case of permutations, and Gai et al. (2012), Chen et al. (2013), Combes et al. (2015), and Kveton et al. (2015b) investigate various general SCB problem settings. More general settings with contextual information (Li et al. (2010); Qin et al. (2014)) and linear generalization (Wen et al. (2015)) are also studied. All of the works listed above are based on the UCB idea.

Motivated by numerous applications in recommender systems and online advertisement placement, SCB models are studied under a more challenging setting of partial feedback, where a learning agent only observes the random variables for a subset the items in the pulled arm. A prime example of SCB model with partial feedback is the cascading bandit model, which is first introduced by Kveton et al. (2015a). Subsequently, Kveton et al. (2015c), Katariya et al. (2016), Lagrée et al. (2016) and Zoghi et al. (2017) study the cascading bandit model in various general settings. Cascading bandits with contextual information (Li et al. (2016)) and linear generalization (Zong et al. (2016)) are also studied. Wang and Chen (2017) provide a general algorithmic framework on SCB models with partial feedback. All of the works listed above are also based on UCB.

On the one hand, UCB has been extensively applied for solving various SCB problems. On the other hand, Thompson sampling (Thompson, 1933; Chapelle and Li, 2011; Russo et al., 2018), an online algorithm based on Bayesian updates, has been shown to be empirically superior compared to UCB and -greedy algorithms in various bandit models. The empirical success has motivated a series of research works on the theoretical performance guarantees of Thompson sampling on multi-armed bandits (Agrawal and Goyal, 2012; Kaufmann et al., 2012; Agrawal and Goyal, 2013a, 2017), linear bandits (Agrawal and Goyal, 2013b), generalized linear bandits (Abeille and Lazaric, 2017), etc. Thompson sampling has also been studied for SCB problems with semi-bandit feedback. Komiyama et al. (2015) study the case when the combinatorial arms constitute a uniform matroid; Wang and Chen (2018) investigate the case of general matroids, and Gopalan et al. (2014) and Hüyük and Tekin (2018) consider settings with general reward functions. In addition, SCB problems with semi-bandit feedback are also studied in the Bayesian setting (Russo and Van Roy, 2014), where the latent model parameters are assumed to be drawn from a known prior distribution. Despite existing works, an analysis of Thompson sampling for an SCB problem in the more challenging case of partial feedback is yet to be done. Our work fills in this gap in the literature, and our analysis provides tools for handling the statistical dependence between Thompson sampling and partial feedback in the cascading bandit models.

## 2 Problem Setup

Let there be ground items, denoted as . Each item is associated with a weight , signifying the item’s click probability. At each time step , the agent selects a list of items to the user, where denotes the set of all -permutations of . The user examines the items from to by examining each item one at a time until possibly all items are examined. For , are i.i.d. and iff user clicks on at time .

The instantaneous reward of the agent at time is

In other words, the agent gets a reward of if for some , and a reward of if for all .

The feedback of the agent at time is defined as

where we assume that the minimum over an empty set is . If , then the agent observes for , and also observes , but does not observe for ; otherwise, , then the agent observes for .

As the agent aims to maximize the sum of rewards over all steps, a expected cumulative regret is defined to evaluate the performance of an algorithm. First, the expected instant reward is

Note that the expected reward is permutation invariant, but the randomness in the set of observed items is not. Without loss of generality, we assume that , then any permutation of maximizes the mean reward. We let be an optimal ordered -subset for maximizing the expected reward; items in as optimal items and others as suboptimal items. In steps, we aim to minimize the expected cumulative regret:

while the vector of click probabilities is not known to the agent, and is chosen online, i.e., dependent on previous choices and the previous rewards.

## 3 Algorithm

Algorithm | Reference | Bounds | Problem Indep. |
---|---|---|---|

TS-Cascade | Present paper | ||

CUCB | Wang and Chen (2017) | ||

CascadeUCB1 | Kveton et al. (2015a) | ||

CascadeKL-UCB | Kveton et al. (2015a) | ||

Cascading Bandits | Kveton et al. (2015a) | (Lower Bd) |

Our algorithm is presented in Algorithm 1. Intuitively, to minimize the expected cumulative regret, the agent aims to learn the true weight of each item by exploring the space to identify (i.e., exploitation) after a hopefully small number of steps. In our algorithm, we approximate the true weight of each item by an statistic at each time step . This statistic is known as the Thompson sample. To do so, first, we sample a one-dimensional standard Gaussian , define the empirical variance of the previously observed arms, and calculate . Secondly, we select such that ; this is reflected in Line 10 of Algorithm 1. Finally, we update the parameters for each observed item in a standard manner by applying Bayes rule on the mean of the Gaussian (with conjugate prior being another Gaussian) in Line 13.

The algorithm results in the following theoretical guarantee. The proof is sketched in Section 4.

###### Theorem 3.1.

Consider the cascading bandit problem. Algorithm TS-Cascade, presented in Algorithm 1, incurs an expected regret at most

where the big notation hides a constant factor that is independent of .

In practical applications, and so the regret bound is essentially . We elaborate on the main features of the algorithm and the guarantee.

Firstly, this is Thompson sampling (Thompson, 1933) applied to the cascading bandits problem with partial feedback. The algorithm, which only utilizes partial information is designed for real applications where a user stops to examine other items after observing an attractive one. Hence, the feedback from the user only reveals whether the examined items are attractive but no information about the un-examined ones.

Secondly, even though it is more natural to use a Beta-Bernoulli update to maintain a Bayesian estimate on the probability (Russo et al., 2018), we use the Gaussian distribution instead of the Beta distribution in our algorithm. The use of the Gaussian is useful, since it allows us to readily generalize the algorithm and analyses to the contextual setting (Li et al., 2010). This handles heterogeneity in the online setting (Li et al., 2016), as well as the linear bandits setting (Zong et al., 2016) for handling a large . We plan to study these extensions in a future work. However, the analysis of the Thompson sampling algorithm with the use of a Gaussian Thompson sample also comes with some difficulties as is not in with probability one. We perform a truncation of the Gaussian Thompson sample in the proof of Lemma 4 to show that this replacement of the Beta by the Gaussian does not incur any significant loss in terms of the regret and the analysis is not affected significantly.

Thirdly, Lines 5–7 indicate that the Thompson sample is constructed to be a Gaussian random variable with mean and variance being the maximum of and . Note that is the variance of a Bernoulli distribution with mean . In Thompson sampling algorithms, the choice of the variance is of crucial importance. The reason why we choose the variance in this manner is to (i) make the Bayesian estimates behave like Bernoulli random variables and to (ii) ensure that it is tuned so that the regret bound has a dependence on (see Lemma 4) and does not depend on any pre-set parameters. We utilize a key result by Audibert et al. (2009) concerning the analysis of using the empirical variance in multi-arm bandit problems to achieve (i). In essence, in Lemma 4, the Thompson sample is shown to depend only on a single source of randomness, i.e., the Gaussian random variable (Line 3 of Algorithm 1). This shaves of a factor of vis-à-vis a more naïve analysis where the variance is pre-set in the relevant probability in Lemma 4 depends on independent random variables.

Finally, in Table 1, we compare our regret bound for cascading bandits to those in the literature which are all based on the UCB idea (Wang and Chen, 2017; Kveton et al., 2015a). Note that the last column indicates whether or not the algorithm is problem dependent; being problem dependent means that the bound depends on the vector of click probabilities . To present our results succinctly, for the problem dependent bounds, we assume that the optimal items have the same click probability and the suboptimal items also have the same click probability ; note though that TS-Cascade makes no such assumption. The gap is a measure of the difficulty of the problem. Table 1 implies that our upper bound grows like just like the others. Our bound also matches the state-of-the-art UCB bound (up to log factors) by Wang and Chen (2017), whose algorithm, when suitably specialized to the cascading bandits setting, is the same as CascadeUCB1 in Kveton et al. (2015a). For the case in which , our bound is a factor worse than the problem independent bound in Wang and Chen (2017) but we are the first to analyze Thompson sampling for the cascading bandits problem.

## 4 Proof Sketch of Theorem 3.1

In this section, we prove a proof sketch of Theorem 3.1. We also provide the proofs of Lemmas 4 and 4. The remaining lemmas are proved in Appendix B.

During the iterations, we update such that it approaches eventually. To do so, we select a set according to the order of ’s at each time step. Hence, if , and are close enough, then we are likely to select the optimal set. This motivates us to define two “nice events” as follows:

where is defined in Line 5 of Algorithm 1, and

For each , we have

###### Lemma 4.1.

Demonstrating that has high probability requires the concentration inequality in Theorem A.1; this is a specialization of a result in Audibert et al. (2009) to Bernoulli random variables. Besides, demonstrating that has high probability requires the concentration property of Gaussian random variables, as established in Theorem A.2.

To start our analysis, define

(4.1) |

Define the set

Recall that . As such, is non-empty, since .

Intuitions behind set and its complement . Ideally, we expect the user to click an item in for every time step . Recall that and are decreasing in , the number of time steps ’s in when we get to observe . Naively, arms in can be thought of as arms that “lack observations”, while arms in can be thought of as arms that are “observed enough”, and are believed to be suboptimal. Note that is a prime example of an arm that is under-observed.

To further elaborate, is the “statistical gap” between the Thompson sample and the latent mean . The gap shrinks with more observations of . To balance exploration and exploitation, for any suboptimal item and any optimal item , we should have . However, this is too much to hope for, and it seems that hoping for to happen would be more viable. (See the forthcoming Lemma 4.)

Further notations. In addition to set , we define as the collection of observations of the agent, from the beginning until the end of time (after everything during time has occurred). More precisely, we define . Recall that is the arm pulled during time step , and is the collection of observed items and their respective values during time step . At the start of time step , the agent has observed everything in , and determine the arm to pull accordingly (see Algorithm 1). Note that event is -measurable. For the convenience of discussion, we define . The first statement in Lemma 4 can thus be rephrased as .

The performance of Algorithm 1 is analyzed using the following four Lemmas. To begin with, Lemma 4 quantifies a set of conditions on and so that the pulled arm belongs to , the collection of arms that lack observations and should be explored. We recall from Lemma 4 that the events and hold with high probability. Subsequently, we will crucially use our definition of the Thompson sample to argue that inequality (4.2) holds with non-vanishing probability when is sufficiently large.

Consider a time step . Suppose that events and inequality

(4.2) |

hold, then the event also holds.

###### Lemma 4.2.

In the following, we condition on and show that is “typical” w.r.t. in the sense of (4.2). Due to the conditioning on , the only source of randomness of the pulled arm is from the Thompson sample. Thus, by analyzing a suitably weighted version of the Thompson samples in (4.2), we disentangle the statistical dependence between partial monitoring and Thompson sampling. Recall that is normal with -measurable mean and variance (Lines 5–7 in Algorithm 1).

There exists an absolute constant independent of such that, for any time step and any historical observation , the following inequality holds:

###### Lemma 4.3.

###### Proof.

We prove the Lemma by setting the absolute constant to be .

For brevity, we define , and for . By the second part of Lemma 4, we know that , so to complete this proof, it suffices to show that . For this purpose, consider

(4.3) | |||

(4.4) | |||

(4.5) | |||

(4.6) |

Step (4.3) is by the definition of in Line 7 in Algorithm 1. It is important to note that these samples share the same random seed . Next, step (4.4) is by the Lemma assumption that , which means that for all . Step (4.5) is an application of the anti-concentration inequality of a normal random variable in Theorem A.2. Step (4.6) is by applying the definition of .∎

Combining Lemmas 4 and 4, we conclude that there exists an absolute constant such that, for any time step and any historical observation ,

(4.7) |

Equipped with (4.7), we are able to provide an upper bound on the regret of our Thompson sampling algorithm at every sufficiently large time step. \thmt@toks\thmt@toks Let be an absolute constant such that Lemma 4 holds true. Consider a time step that satisfies . Conditional on an arbitrary but fixed historical observation , we have

## Lemma 4.4. |

The proof of Lemma 4 relies crucially on truncating the original Thompson sample to . Under this truncation operation, remains optimal under (as it was under ) and , i.e., the distance from the truncated Thompson sample to the ground truth is not increased.

For any satisfying , define

we unravel the upper bound in Lemma 4 to establish the expected regret at time step :

(4.8) | |||

(4.9) |

where (4.8) follows by assuming is sufficiently large.

We now bound the total regret by summing the per time step regret (4.9), and demonstrate the telescoping property of the summation in the following Lemma. \thmt@toks\thmt@toks For any realization of historical trajectory , we have

## Lemma 4.5. |
|||

, , , , | |||

Note that here we prove a worst case bound, without needing the expectation operator.

###### Proof.

Recall that for each and , is the number of rounds in when we get to observe the outcome for item . Since involves , we first bound this term. The definitions of and yield that

Subsequently, we decompose according to its definition. For a fixed but arbitrary item , consider the sequence . Clearly, if and only if the decision maker observes the realization of item at . Let be the time steps when . We assert that for each . Indeed, prior to time steps , item is observed precisely in the time steps . Thus, we have

(4.10) |

Now we complete the proof as follows:

(4.11) | |||

(4.12) | |||

(4.13) |

where (4.11) follows from (4.10), (4.12) follows from the Cauchy-Schwarz inequality, and (4.13) is because the decision maker can observe at most items at each time step, hence .∎

Finally, we bound the total regret from above by considering the time step , and then bound the regret for the time steps before by 1 and the regret for time steps after by inequality (4.9), which holds for all :

It is clear that the third term is , and by Lemma 4, the second term is . Altogether, Theorem 3.1 is proved.

## 5 Experiments

TS-Cascade | CascadeKL-UCB | CascadeUCB1 | ||||||
---|---|---|---|---|---|---|---|---|

16 | 2 | 0.15 | 377.07 11.67 | 3.16 | 359.35 26.42 | 54.3 | 1277.42 25.88 | 2.82 |

16 | 4 | 0.15 | 294.55 15.08 | 3.03 | 265.9 20.36 | 54.48 | 990.51 31.72 | 2.84 |

16 | 8 | 0.15 | 138.85 9.81 | 3.51 | 148.36 12.35 | 55.5 | 555.83 14.41 | 3.17 |

32 | 2 | 0.15 | 738.19 19.23 | 3.41 | 764.42 48.57 | 105.4 | 2711.44 58.41 | 2.98 |

32 | 4 | 0.15 | 612.36 10.66 | 3.55 | 619.68 34.56 | 105.56 | 2237.77 43.7 | 3.02 |

32 | 8 | 0.15 | 381.8 13.19 | 3.68 | 419.39 19.59 | 105.64 | 1526.97 24.48 | 3.14 |

32 | 2 | 0.075 | 1159 63.43 | 3.49 | 1583.33 104.04 | 106.62 | 4217.87 129.08 | 3.95 |

32 | 4 | 0.075 | 1062.9 80.06 | 3.55 | 1208.06 59.25 | 106.08 | 3301.44 85.43 | 3.84 |

32 | 8 | 0.075 | 631.45 51.51 | 3.58 | 718.65 32.27 | 106.51 | 1890.06 47.8 | 3.97 |

64 | 2 | 0.075 | 1810.43 126.74 | 4.74 | 3169.17 156.98 | 207.31 | 7599.58 199.99 | 4.24 |

64 | 4 | 0.075 | 1730.13 128.09 | 4.88 | 2512.28 106.85 | 208.08 | 6437.43 239.96 | 5.04 |

64 | 8 | 0.075 | 1175.07 46.91 | 4.7 | 1565.76 72.98 | 208.34 | 3962.35 87.61 | 4.77 |

128 | 2 | 0.075 | 2784.44 185.08 | 5.36 | 6160.86 300.48 | 414.45 | 11055.68 156.27 | 5.17 |

128 | 4 | 0.075 | 2837.25 239.41 | 4.76 | 5004.45 188.68 | 412.55 | 11516.47 227.48 | 4.7 |

128 | 8 | 0.075 | 2004.58 122.26 | 4.87 | 3084.67 105.78 | 413.6 | 7432.14 129.24 | 4.61 |

256 | 2 | 0.075 | 4128.96 400.88 | 8.35 | 10426.63 249.33 | 816.52 | 12191.23 39.69 | 7.22 |

256 | 4 | 0.075 | 4376.73 373.99 | 7.49 | 9389.72 251.5 | 818.07 | 15748.08 131.08 | 7.56 |

256 | 8 | 0.075 | 3258.24 238.91 | 7.24 | 6019.24 145.95 | 820 | 12417.86 160.53 | 7.83 |

In this section, we evaluate the performance of TS-Cascade using numerical simulations. To demonstrate the effectiveness of our algorithm, we compare the expected cumulative regret of TS-Cascade to CascadeKL-UCB and CascadeUCB1 in Kveton et al. (2015a). We reimplemented the latter two algorithms and checked that their performances are roughly the same as those in Table 1 of Kveton et al. (2015a).

We set the optimal items to have the same click probability and the suboptimal items to also have the same click probability . The gap . We set , and vary , , and . We conduct independent simulations with each algorithm under each setting of , , and . We calculate the average and standard deviation of , and as well as the average running time of each experiment. Here we only present a subset of the results. More details are given in Appendix C.

In Table 2, we compare the performances of algorithms under different settings. Since CascadeKL-UCB perfoms far better than CascadeUCB1, we mainly focus on the comparison between our method and CascadeKL-UCB. In most cases, the expected cumulative regret of our algorithm is significantly smaller than that of CascadeKL-UCB, especially when is large and is small. Note that a larger means that the problem size is larger. A smaller implies that the difference between optimal and sub-optimal arms are less pronounced. Hence, when is large and is small, the problem is “more difficult”. However, the standard deviation of our algorithm is larger than that of CascadeKL-UCB in some cases. A possible explanation is that Thompson sampling yields more randomness than UCB due to the additional randomness of the Thompson samples . In contrast, UCB-based algorithms do not have this source of randomness as each upper confidence bound is deterministically designed. Furthermore, Table 2 suggests that our algorithm is much faster than CascadeKL-UCB and is just as fast as CascadeUCB1. The reason why CascadeKL-UCB is so slow is because an UCB has to be computed via an optimization problem for every . In contrast, TS-Cascade in Algorithm 1 does not contain any computationally expensive steps.

In Figure 1, we plot as a function of for TS-Cascade, CascadeKL-UCB and CascadeUCB1 when , and . It is clear that our method outperforms the two UCB algorithms. For the case where the number of ground items is large, the UCB-based algorithms do not demonstrate the behavior even after iterations. In contrast, for TS-Cascade behaves as which implies that the empirical performance corroborates the upper bound derived in Theorem 3.1. We have plotted for other settings of , and in Appendix C and the same conclusion can be drawn.

## 6 Summary and Future work

This work presents the first theoretical analysis of Thompson sampling for cascading bandits. The expected regret matches the state-of-the-art based on UCB by Wang and Chen (2017) (which is identical to CascadeUCB1 in Kveton et al. (2015a)). Empirical experiments, however, show the clear superiority of TS-Cascade over CascadeKL-UCB and CascadeUCB1 in terms of regret and running time.

The following are avenues for further investigations. From Table 2, we see that a problem-independent lower bound is still not available. It is envisioned that a judicious construction of an adversarial bandit example, together with the information-theoretic technique of (Auer et al., 2002, Theorem 5.1) will lead to a lower bound of the form , matching Theorem 3.1 here and Wang and Chen (2017). Next, we envision that a refinement of the proof techniques herein, especially the design of Thompson samples to be Gaussian, would be useful for generalization the contextual setting (Li et al., 2010; Qin et al., 2014; Li et al., 2016).

## References

- Abeille and Lazaric (2017) M. Abeille and A. Lazaric. Linear Thompson sampling revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, pages 176–184, 2017.
- Abramowitz and Stegun (1964) M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York, ninth edition, 1964.
- Agrawal and Goyal (2012) S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory, volume 23, pages 39.1–39.26, 2012.
- Agrawal and Goyal (2013a) S. Agrawal and N. Goyal. Further optimal regret bounds for Thompson sampling. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, volume 31, pages 99–107, 2013a.
- Agrawal and Goyal (2013b) S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 127–135, 2013b.
- Agrawal and Goyal (2017) S. Agrawal and N. Goyal. Near-optimal regret bounds for Thompson sampling. J. ACM, 64(5):30:1–30:24, Sept. 2017.
- Anantharam et al. (1987) V. Anantharam, P. Varaiya, and J. Walrand. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays–Part I: I.I.D. rewards. IEEE Transactions on Automatic Control, 32(11):968–976, November 1987.
- Audibert et al. (2009) J.-Y. Audibert, R. Munos, and C. SzepesvÃ¡ri. Exploration-âexploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 – 1902, 2009. Algorithmic Learning Theory.
- Auer et al. (2002) P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 32(1):48–77, 2002.
- Chapelle and Li (2011) O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pages 2249–2257. Curran Associates, Inc., 2011.
- Chen et al. (2013) W. Chen, Y. Wang, and Y. Yuan. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 151–159, 2013.
- Combes et al. (2015) R. Combes, M. S. Talebi, A. Proutière, and M. Lelarge. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems 28, 2015.
- Craswell et al. (2008) N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In Proceedings of the 1st ACM International Conference on Web Search and Data Mining, pages 87–94, 2008.
- Gai et al. (2010) Y. Gai, B. Krishnamachari, and R. Jain. Learning multiuser channel allocations in cognitive radio networks: A combinatorial multi-armed bandit formulation. In 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (DySPAN), pages 1–9, April 2010.
- Gai et al. (2012) Y. Gai, B. Krishnamachari, and R. Jain. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5):1466–1478, Oct 2012.
- Gopalan et al. (2014) A. Gopalan, S. Mannor, and Y. Mansour. Thompson sampling for complex online problems. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 100–108, 2014.
- Hüyük and Tekin (2018) A. Hüyük and C. Tekin. Thompson sampling for combinatorial multi-armed bandit with probabilistically triggered arms. Manuscript, 2018. https://arxiv.org/abs/1809.02707.
- Katariya et al. (2016) S. Katariya, B. Kveton, C. Szepesvari, and Z. Wen. Dcm bandits: Learning to rank with multiple clicks. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1215–1224, 2016.
- Kaufmann et al. (2012) E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory, pages 199–213, 2012.
- Komiyama et al. (2015) J. Komiyama, J. Honda, and H. Nakagawa. Optimal regret analysis of Thompson sampling in stochastic multi-armed bandit problem with multiple plays. In Proceedings of The 32nd International Conference on Machine Learning, volume 37, pages 1152–1161, 2015.
- Kveton et al. (2014) B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson. Matroid bandits: Fast combinatorial optimization with learning. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, pages 420–429, 2014.
- Kveton et al. (2015a) B. Kveton, C. Szepesvari, Z. Wen, and A. Ashkan. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pages 767–776, 2015a.
- Kveton et al. (2015b) B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari. Tight regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38, pages 535–543, 2015b.
- Kveton et al. (2015c) B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvári. Combinatorial cascading bandits. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 1450–1458, 2015c.
- Lagrée et al. (2016) P. Lagrée, C. Vernade, and O. Cappe. Multiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems 29, pages 1597–1605. Curran Associates, Inc., 2016.
- Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661–670, 2010.
- Li et al. (2016) S. Li, B. Wang, S. Zhang, and W. Chen. Contextual combinatorial cascading bandits. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1245–1253, 2016.
- Qin et al. (2014) L. Qin, S. Chen, and X. Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. In SDM, pages 461–469. SIAM, 2014.
- Russo and Van Roy (2014) D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
- Russo et al. (2018) D. Russo, B. V. Roy, A. Kazerouni, I. Osband, and Z. Wen. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018.
- Thompson (1933) W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3–4):285–294, 1933.
- Wang and Chen (2017) Q. Wang and W. Chen. Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications. In Advances in Neural Information Processing Systems, pages 1161–1171, 2017.
- Wang and Chen (2018) S. Wang and W. Chen. Thompson sampling for combinatorial semi-bandits. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 5114–5122, 2018.
- Wen et al. (2015) Z. Wen, B. Kveton, and A. Ashkan. Efficient learning in large-scale combinatorial semi-bandits. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1113–1122, 2015.
- Zoghi et al. (2017) M. Zoghi, T. Tunys, M. Ghavamzadeh, B. Kveton, C. Szepesvari, and Z. Wen. Online learning to rank in stochastic click models. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 4199–4208, 06–11 Aug 2017.
- Zong et al. (2016) S. Zong, H. Ni, K. Sung, N. R. Ke, Z. Wen, and B. Kveton. Cascading bandits for large-scale recommendation problems. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, pages 835–844, 2016.

## Appendix A Useful theorems

Here are some basic facts from the literature that we will use:

###### Theorem A.1 ((Audibert et al., 2009), speicalized to Berounlli random variables).

Consider independently and identically distributed Bernoulli random variables , which have the common mean . In addition, consider their sample mean and their sample variance :

For any , the following inequality holds:

###### Theorem A.2 ((Abramowitz and Stegun, 1964)).

Let . For any , the following inequalities hold:

## Appendix B Proofs of main results

### b.1 Proof of Lemma 4

###### Lemma B.1.

###### Proof.

Bounding probability of event We first consider a fixed non-negative integer and a fixed item . Let be i.i.d. Bernoulli random variables, with the common mean . Denote as the sample mean, and as the empirical variance. By applying Theorem A.1 with , we have

(B.1) |

By an abuse of notation, let if . Inequality (B.1) implies the following concentration bound when is non-negative:

(B.2) |

Subsequently, we can establish the concentration property of by a union bound of over :