Be Aware of Non-Stationarity: Nearly Optimal Algorithms for Piecewise-Stationary Cascading Bandits
Cascading bandit (CB) is a variant of both the multi-armed bandit (MAB) and the cascade model (CM), where a learning agent aims to maximize the total reward by recommending out of items to a user. We focus on a common real-world scenario where the user’s preference can change in a piecewise-stationary manner. Two efficient algorithms, GLRT-CascadeUCB and GLRT-CascadeKL-UCB, are developed. The key idea behind the proposed algorithms is incorporating an almost parameter-free change-point detector, the Generalized Likelihood Ratio Test (GLRT), within classical upper confidence bound (UCB) based algorithms. Gap-dependent regret upper bounds of the proposed algorithms are derived and both match the lower bound up to a poly-logarithmic factor in the number of time steps . We also present numerical experiments on both synthetic and real-world datasets to show that GLRT-CascadeUCB and GLRT-CascadeKL-UCB outperform state-of-the-art algorithms in the literature.
Online recommendation (Li et al., 2016) and web search (Dupret and Piwowarski, 2008; Zoghi et al., 2017) are important in the modern economy. Based on a user’s browsing history, these systems strive to maximize satisfaction and minimize the regret by presenting the user with a list of items (e.g., web pages and advertisements) that meet her/his preference. We focus on the popular cascading bandit (CB) model (Kveton et al., 2015), which is a variant of both the multi-armed bandit (MAB) (Auer et al., 2002a) and the cascade model (CM) (Craswell et al., 2008). In the CB model, the learning agent aims to identify the most attractive items out of total items contained in the ground set. At each time, the learning agent recommends a ranked list of items and receives the reward and feedback from the user. The goal of the agent is to maximize the total reward.
Existing works on CB (Kveton et al., 2015; Cheung et al., 2019) and MAB (Lai and Robbins, 1985; Auer et al., 2002a; Li et al., 2019) can be categorized according to whether stationary or non-stationary environment is studied. The stationary environment refers to the scenario where the reward distributions of arms (in MAB) or attraction distributions of items (in CB) do not evolve over time. On the other hand, a non-stationary environment is prevalent in real-world applications such as web search, online advertisement, and recommendation (Jagerman et al., 2019; Yu and Mannor, 2009; Pereira et al., 2018). If algorithms designed for stationarity are directly applied to a non-stationary environment, linear regret may occur (Li and de Rijke, 2019; Garivier and Moulines, 2011). Two types of non-stationary environment models are proposed and studied in the literature. One is adversarial environment (Auer et al., 2002b; Littlestone and Warmuth, 1994), whereas the other is piecewise-stationary environment. The piecewise-stationary environment is introduced in prior works on MAB (Hartland et al., 2007; Kocsis and Szepesvári, 2006; Garivier and Moulines, 2011), where the user’s preference remains stationary in certain time periods, named piecewise-stationary segments, but can shift abruptly at some unknown time steps, called change-points. In this paper, we focus on the piecewise-stationary environment since it models real-world applications better. For instance, in recommendation systems, user’s preference for an item is unlikely to be invariant (stationary environment) or change significantly at each time step (adversarial environment).
To address the piecewise-stationary MAB, two types of approaches have been proposed in the literature: passively adaptive approaches (Garivier and Moulines, 2011; Besbes et al., 2014; Wei and Srivatsva, 2018) and actively adaptive approaches (Cao et al., 2019; Liu et al., 2018; Besson and Kaufmann, 2019; Auer et al., 2019). The passively adaptive approaches make the decisions based on the most recent observations and are unaware of when a change-point occurs. For active adaption, a change-point detection algorithm such as CUSUM (Page, 1954), Page Hinkley Test (PHT) (Hinkley, 1971), Generalized Likelihood Ratio Test (GLRT) (Willsky and Jones, 1976) or Sliding Window (SW) (Cao et al., 2019) is included. It has been demonstrated in MAB that actively adaptive approaches outperform passively adaptive approaches via extensive numerical experiments (Mellor and Shapiro, 2013; Cao et al., 2019; Liu et al., 2018). However, for CB, only passively adaptive approaches have studied in the literature (Li and de Rijke, 2019). Specifically, our main contributions are summarized as follows.
Unlike previous passively adaptive algorithms, such as CascadeDUCB and CascadeSWUCB (Li and de Rijke, 2019), we propose two actively adaptive algorithms, GLRT-CascadeUCB and GLRT-CascadeKL-UCB by incorporating an efficient change-point detection component, the GLRT, within upper confidence bound (UCB) (Auer et al., 2002a) and Kullback–Leibler UCB (KL-UCB) (Garivier and Cappé, 2011; Cappé et al., 2013) algorithms. The GLRT is almost parameter-free as compared to change-point detection methods used in previous non-stationary bandit literature (Liu et al., 2018; Cao et al., 2019).
We derive gap-dependent upper bounds on the regret of the proposed GLRT-CascadeUCB and GLRT-Casc-adeKL-UCB. When the number of piecewise-stationary segments is known, regret of is established for both algorithms, where is the number of items and is the number of time steps. When is unknown, the regret is for both algorithms. Compared to the best existing passively adaptive algorithm CascadeSWUCB (Li and de Rijke, 2019), whose regret is , the proposed algorithms improve the dependence on the regret bound.
The efficiency of proposed GLRT-CascadeUCB and GLRT-CascadeKL-UCB relative to other state-of-the-art algorithms is demonstrated on both synthetic and real-world datasets.
Compared to recent works on piecewise-stationary MAB (Besson and Kaufmann, 2019) and combinatorial MAB (CMAB) (Zhou et al., 2019) that adopt GLRT as the change-point detector, the problem setting considered in this paper is different. In MAB, only one selected item rather than a list of items is allowed at each time. Notice that although CMAB (Combes et al., 2015; Cesa-Bianchi and Lugosi, 2012; Chen et al., 2016) also allow a list of items each time, they have full feedback on all items under semi-bandit setting. Yet in most cases, the learning agent can only observe partial feedback, which will be clearer later. Furthermore, we develop the analysis of both UCB-based and KL-UCB based algorithms for CB, whereas only one of them (either UCB-based or KL-UCB based algorithm) is analyzed in Besson and Kaufmann (2019) and Zhou et al. (2019).
The remainder of the paper is organized as follows. We describe the problem formulation in Section 2. The proposed algorithms, GLRT-CascadeUCB and GLRT-CascadeKL-UCB, are explained in detail in Section 3. We prove upper bounds on the regret of the proposed algorithms in Section 4 and present the numerical experiment is Section 5. Finally, we conclude the paper in Section 6.
2 Problem Formulation
2.1 Cascade Model
Before introducing the piecewise-stationary CB, we will first briefly review the cascade model in this subsection.
The CM (Craswell et al., 2008) is prevalent for explaining user’s behavior (e.g., click data) in web search and online advertising. In CM, the ground set that contains all items (e.g., all web pages or advertisements) is denoted as . Per slot, the user is presented with a -item ranked list by the learning agent, where is the set of all -permutations of the ground set with cardinality of . The user browses the list from the first item in order and clicks the item that attracts her/him. If the user is attracted by item , the user will click on it and will not browse the remaining items (multi-click cases (Wang et al., 2015; Yue et al., 2010) are beyond the scope of this paper). Otherwise, if the user is not attracted by the item , the user will browse item until the last item in the list. During browsing, the item attracts the user with probability after the user browses it. We further pose a reasonable assumption on as follows.
The attraction probability of item is independent of other items, where is the associated attraction probability vector of .
After user clicks on , the index of is observed by the learning agent and used to learn the user’s preference. Note that upon receiving the feedback, we can determine that are browsed but not attractive, is browsed and attractive, and are unobserved by the user.
2.2 Piecewise-Stationary Cascading Bandit Problem
A piecewise-stationary CB is characterized by a tuple , where is a sequence of time steps. The attraction of item at time is modeled as a Bernoulli random variable , with containing all the attractions of the ground set. In our notational convention we use to indicate item is attractive to the user, and the pmf of is . In a piecewise-stationary CB, changes across in a piecewise-stationary manner. Clearly, the are parameterized by attraction probability vector . In addition, we have
To formally define the piecewise-stationary environment, the number of piecewise-stationary segments is defined as
where is the indicator function. Note that when a change-point occurs, at least one item changes its attraction distribution. Hence, asynchronous attraction distribution changes are allowed. By the definition in (1), the number of change-points is , and the change-points are denoted as . Specifically, and are defined for consistency. For each piecewise-stationary segment , and are adopted to denote the attraction distribution and the expected attraction of item on the th piecewise-stationary segment respectively, where is the vector that contains the expected attractions of all items in the th segment.
Recommendation proceeds as follows. At time , the agent recommends a list of items , where the list is decided based on the feedback of the user up to time . Here, the user’s feedback at time is formulated as:
The reward at time can be written as,
The agent’s goal is to maximize the cumulative reward across . Equivalently, the agent’s policy is evaluated by its expected cumulative regret:
is the optimal list that maximizes the expected reward at time , and the expectation is taken with respect to s and the selection of s. Under this setting, the optimal list is the list that maximizes the probability that at least one item is attractive in the recommended list, which is equivalent to the most attractive items at time . Since the reward defined in (2) is invariant to permutations of , there are optimal list at each time . Note that remains the same up to a permutation during a piecewise-stationary segment unless a change-point occurs.
2.3 Generalized Likelihood Ratio Test for Bernoulli Distribution
Sequential change-point detection is of fundamental importance in statistical sequential analysis, however, most existing algorithms have additional assumptions on both pre-change and post-change distributions (Hadjiliadis and Moustakides, 2006; Siegmund, 2013; Draglia et al., 1999; Siegmund and Venkatraman, 1995) or even require both the pre-change and post-change distributions to be known (Lorden et al., 1971; Moustakides et al., 1986). However, these approaches are not applicable to CB, since the distributions are unknown to the agent and must be learned. In general, with pre-change and post-change distributions unknown, developing algorithms with provable guarantees is challenging. Several approaches, however, have recently appeared in the literature (Liu et al., 2018; Cao et al., 2019; Besson and Kaufmann, 2019). Here we adopt the GLRT (Besson and Kaufmann, 2019) (See Algorithm 1). Compared to other existing change-point detection methods that have provable guarantees, advantages of GLRT are twofold: 1) Fewer tuning parameters. The only required parameter for GLRT is the confidence level of change-point detection , while CUSUM (Liu et al., 2018) and SW (Cao et al., 2019) have three and two parameters to be manually tuned, respectively. 2) Less needed prior knowledge. Whereas both CUSUM and SW require the smallest magnitude among the change-points as prior knowledge, GLRT does not.
Next, we consider the GLRT. Suppose we have a sequence of Bernoulli random variables and aim to determine if a change-point exists as soon as possible. Under Bernoulli distribution, this problem can be formulated as a parametric sequential test of the following two hypotheses:
where is the Bernoulli distribution with mean . The Bernoulli GLR statistic is defined as
where is the empirical mean of observations from to , and is the Kullback–Leibler (KL) divergence of two Bernoulli distributions,
The detection time of Bernoulli GLRT change-point detection for a length sequence with threshold is
and has the same definition as that in (13) of Kaufmann and Koolen (2018).To better understand the performance of GLRT, it is instructive to use an example.
Example 1 (Efficiency of GLRT).
Consider a sequence of Bernoulli random variables with , where are generated from Bern(0.2) and the remaining ones are generated from Bern(0.8), as shown in Figure 1. By choosing , the expectation of detection time after 100 Monte Carlo trials is .
The proposed algorithms, GLRT-CascadeUCB and GLRT-CascadeKL-UCB, are presented in Algorithm 2, which are motivated by Kveton et al. (2015); Besson and Kaufmann (2019). Here, we denote the last detection time as . The number of observations and its sample mean from the th item after are denoted as and , respectively. Three phases comprise the proposed algorithms.
Phase 1: Forced uniform exploration to ensure that sufficient samples are gathered for all items to perform the Bernoulli GLRT detection (Algorithm 1).
Phase 2: UCB-based exploration (UCB or KL-UCB) to learn the optimal list on each piecewise-stationary segment.
Phase 3: Bernoulli GLRT change-point detection (Algorithm 1) to monitor if global restart should be triggered.
The proposed algorithms only require the time steps , the ground set , the number of items in list , the uniform exploration probability , and confidence level as inputs. The choices of and will be discussed in Section 4, but here we want to emphasize that is the only parameter needed in GLRT, whereas relates to uniform exploration in bandit problems and also appears in other algorithms (Liu et al., 2018; Cao et al., 2019).
We discuss the proposed algorithms in detail here. The algorithm determines whether to perform a uniform exploration or a UCB-based exploration depending on whether line 3 of Algorithm 2 is satisfied, which ensures the fraction of time steps performing the uniform exploration phase is about . If the uniform exploration is triggered, the first item in the recommended list will be item , and the remaining items in the list are chosen uniformly at random (line 4), which ensures item will be observed by the user. If UCB-based exploration is adopted at time , the algorithms will choose items (line 6) with largest UCB indices,
By recommending the list and observing the user’s feedback (line 8), we update the statistics (line 10) and perform the Bernoulli GLRT detection (line 11). If the Bernoulli GLRT detection says True, we set for all , and (line 12). Finally, the UCB indices of each item are computed as follows (line 17),
where , and . Notice that (7) is the UCB indices of GLRT-CascadeUCB, and (8) is the UCB indices of GLRT-CascadeKL-UCB. For the intuitions behind, we refer the readers to Proof of Theorem 1 in Auer et al. (2002a) and Proof of Theorem 2 in Cappé et al. (2013).
4 Performance Analysis
The -step regret of the proposed GLRT-CascadeUCB and GLRT-CascadeKL-UCB will be derived in this section. Upper bounds on the regret of GLRT-CascadeUCB and GLRT-CascadeKL-UCB are developed in Section 4.1 and 4.2, respectively. Discussions of our theoretical guarantees are in Section 4.3.
Without loss of generality, for the th piecewise-stationary segment, the ground set is first sorted in decreasing order according to attraction probabilities, that is , for all . The optimal list at th segment is thus all the permutations of list . Item is optimal if , otherwise item is suboptimal if . To simplify the exposition, the gap between the attraction probabilities of suboptimal item and the optimal item at th segment is defined as:
Similarly, the largest amplitude change among items at change-point is defined as
with . We have the following assumption for the theoretical analysis.
Define and assume , , with .
The implication of Assumption 2 is , as one can find in Appendix A.4. Note that Assumption 2 is standard in piecewise-stationary environment, and similar or same assumption is made in other change-detection based bandit algorithms (Liu et al., 2018; Cao et al., 2019; Besson and Kaufmann, 2019) as well. It requires the length of the piecewise-stationary segment between two change-points to be large enough. Assumption 2 guarantees that with high probability all the change-points are detected within the interval , which is equivalent to saying all change-points are detected correctly (low probability of false alarm) and quickly (low detection delay). This result be formally stated in Lemma 3. In our simulations, the proposed algorithms work well even when Assumption 2 does not hold.
4.1 Regret Upper Bound for GLRT-CascadeUCB
Upper bound on the regret of GLRT-CascadeUCB is as follows.
The theorem is proved in Appendix A.3. ∎
Theorem 1 indicates that the upper bound on the regret of GLRT-CascadeUCB is incurred by two types of costs that are further decomposed into four terms. Terms (a) and (b) upper bound the costs of UCB-based exploration and uniform exploration, respectively. Another type of cost is from change-point detection, where the costs incurred by detection delay and the incorrect detections are bounded by terms (c) and (d). Corollary 1 follows directly from Theorem 1.
Let denote the smallest magnitude of any change-point on any item, and be the smallest magnitude of a suboptimal gap on any one of the stationary segments. The regret of GLRT-CascadeUCB is established depending whether one has prior information of ,
( known): Choosing gives
( unknown): Choosing gives
Please refer to Appendix A.4 for proof. ∎
As a direct result of Theorem 1, the upper bounds on the regret of GLRT-CascadeUCB in Corollary 1 consist of two terms, where the first is incurred by the UCB-based exploration and the second is from the change-point detection component. As becomes larger, the regret is dominated by the cost of the change-point detection component, implying the regret is or . Similar phenomena can also be found in piecewise-stationary MAB (Liu et al., 2018; Cao et al., 2019; Besson and Kaufmann, 2019).
The proof outline of Theorem 1 is presented in the following, which is based on a recursive method given the Lemmas 1 to 4. We start by upper bounding the regret under the stationary scenario with , , and .
Under stationary scenario (), the regret of GLRT-CascadeUCB is upper bounded as
where is the first detection time.
Proof is presented in Appendix A.1. ∎
Then we bound the false alarm probability in Lemma 1 under previously mentioned stationary scenario.
Consider the stationary scenario, with confidence level for the Bernoulli GLRT, and we have that
Please refer to Appendix A.2 for proof. ∎
Next, we define the event that all the change-points up to th have been detected quickly and correctly:
Lemma 3 below shows happens with high probability.
When holds, GLRT with confidence level is capable of detecting the change point correctly and quickly with high probability, that is,
where is the detection time of th change-point.
Please refer to Lemma 12 in Besson and Kaufmann (2019). ∎
In the next lemma, we bound the expected detection delay with the good event holds.
The expected delay given is:
By the definition of , the conditional expected delay is obviously upper bounded by . ∎
Thus, we can decompose into good events that GLRT-CascadeUCB reinitializes the algorithm correctly and quickly after all change-points and bad events that either large detection delays or false alarms happen. Notice that Lemmas 1 and 4 provide upper bounds on the regret of the stationary scenario and the detection delays of good events, respectively. Lemmas 2 and 3 show that with high probability, all change-points can be detected correctly and quickly, and thus lead to upper bounds of bad events. By summing up all regrets from good events and bad events, an upper bound on the regret of GLRT-CascadeUCB is then developed. Detailed steps can be found in Appendix A.
4.2 Regret Upper Bound for GLRT-CascadeKL-UCB
In this subsection, we develop the upper bound on the -step regret of GLRT-CascadeKL-UCB.
Please refer to Appendix B.1 for proof. ∎
Similarly, the upper bound on the regret of GLRT-CascadeKL-UCB in Theorem 2 can be decomposed into four different terms, where (a) is incurred by the incorrect change-point detections, (b) is the cost of the uniform exploration, (c) is incurred by the change-point detection delay, and (d) is the cost of the KL-UCB based exploration.
The proof is very similar to that of Corollary 1. ∎
We sketch the proof for Theorem 2 as follows, and the detailed proofs are presented in Appendix B. By defining the events and as the algorithm performing uniform exploration and the change-points can be detected correctly and quickly, we can first bound the cost of uniform exploration and cost of incorrect and slow detection of change-points , which can be realized by applying similar proof techniques as for Lemma 1 and 3. Then, we can divide the regret into different piecewise-stationary segments. By bounding the cost of detection delays (using Lemmas 4) and the KL-UCB based exploration, the upper bound on regret is thus established.
Corollaries 1 and 2 reveal that by properly choosing the confidence level and the uniform exploration probability , the regrets of GLRT-CascadeUCB and GLRT-CascadeKL-UCB can be upper bounded by (when is unknown)
where notation hides the gap term and the lower order term . Notice that the upper bound in (12) does not require knowledge of the number of piecewise-stationary segments . On the other hand, if is known, a better upper bound can be achieved,
where the dependence on is improved to compared with (12). Note that compared to CUSUM in Liu et al. (2018) and SW in Cao et al. (2019), the tuning parameters are fewer and does not require the smallest magnitude among the change-points as shown in Corollary 1. Moreover, parameter and follow simple rules as shown in Corollary 1, while complicated parameter tuning steps are required in CUSUM and SW.
The upper bounds on the regret of GLRT-CascadeUCB and GLRT-CascadeKL-UCB are improved over state-of-the-art algorithms CascadeDUCB and CascadeSWUCB in Li and de Rijke (2019) either in the dependence on or both and , as their upper bounds are and respectively. In real-world applications, both and can be huge, for example, and are in the millions in web search, where the improvements are significant. Furthermore, compared with lower bound in Li and de Rijke (2019), the proposed two algorithms are nearly optimal in up to a poly-logarithmic factor .
In this section, numerical experiments on both synthetic and real-world datasets are carried out to show the performances of proposed algorithms relative to state-of-the-art ones. To be more specific, four baseline algorithms and two oracle algorithms are included in the experiments, where CascadeUCB1 (Kveton et al., 2015) and CascadeKL-UCB (Kveton et al., 2015) are nearly optimal algorithms for CB under stationary environment; CascadeDUCB (Li and de Rijke, 2019) and CascadeSWUCB (Li and de Rijke, 2019) are algorithms adopting passively adaptive approach for piecewise-stationary CB; Oracle-CascadeUCB1 and Oracle-CascadeKL-UCB are oracle algorithms that know exactly when the change-points occur and thus are capable of restarting the algorithms immediately after the change-points. The goal is to identify the most attractive items and maximize the expected number of clicks. Based on the theoretical analysis in Li and de Rijke (2019), we choose , for CascadeDUCB and choose for CascadeSWUCB. For GLRT-CascadeUCB and GLRT-CascadeKL-UCB, we set and set for both synthetic and real-world datasets.
5.1 Synthetic Dataset
In this experiment, let and . We consider a simulated piecewise-stationary environment setup as follows: 1) the expected attractions of the top items remain constant over the whole time horizon; 2) in each even piecewise-stationary segment, three suboptimal items are chosen randomly and their expected attractions are set to be ; 3) in each odd piecewise-stationary segment, we reset the expected attractions to the initial state. In this experiment, we set the length of each piecewise-stationary segment to be and choose , which is in total of steps. A detailed depiction of the piecewise-stationary environment can be found in Figure 2.
We report the -step cumulative regrets of all the algorithms by taking the average of the regrets over Monte Carlo simulations in Figure 3. Meanwhile, the means and standard deviations of the -step regrets of all algorithms on synthetic dataset are listed in Table 1. The results show that the proposed GLRT-CascadeUCB and GLRT-CascadeKL-UCB achieve better performances than other algorithms and are very close to the oracle algorithms. Compared with the best existing algorithm CascadeSWUCB, GLRT-CascadeUCB achieves a reduction of the cumulative regret and this fraction is for GLRT-CascadeKL-UCB, which is consistent with difference of empirical results between passively adaptive approach and actively adaptive approach in MAB. Notice that although CascadeDUCB seems to be adaptive to the change-points, the performance is even worse than algorithms designed for stationary CB. The possible reasons are two-fold: 1) The theoretical result shows that CascadeDUCB is worse than other algorithms for piecewise-stationary CB by a factor; 2) the time steps is not long enough. It is worth mentioning that our experiment on this synthetic dataset violates the Assumption 2, as it would require more than time steps for each piecewise-stationary segment. Surprisingly, the proposed algorithms are capable of detecting all the change-points correctly with high probability and sufficiently fast in our experiments, as shown in Table 3.
5.2 Yahoo! Dataset
In this subsection, we adopt the benchmark dataset for the evaluation of bandit algorithms published by Yahoo!111Yahoo! Front Page Today Module User Click Log Dataset on https://webscope.sandbox.yahoo.com. This dataset, using binary values to indicate if there is a click or not, contains user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! (Li et al., 2011), where each item corresponds to one article. We pre-process the dataset by adopting the same method as Cao et al. (2019), where , and . To make the experiment nontrivial, several modifications are applied to the dataset: 1) the click rate of each item is enlarged by times; 2) the time horizon is reduced to , which is shown in Figure 4. The cumulative regrets of all algorithms by averaging Monte Carlo trials are presented in Figure 5, which shows the regrets of our proposed algorithms are just slightly above the oracle algorithms and significantly outperform other algorithms. The means and standard deviations of the -step regrets of all algorithms on Yahoo! dataset are in Table 1. Again, although the Assumption 2 is not satisfied in Yahoo! dataset, GLRT based algorithms detect the change-points correctly and quickly and detailed mean detection time of each change-point with its standard deviation is in Table 3.
Two new active adaptive algorithms for piecewise-stationary cascading bandit, namely GLRT-CascadeUCB and GLRT-CascadeKL-UCB are developed in this work. Under mild assumptions, it is analytically established that GLRT-CascadeUCB and GLRT-CascadeKL-UCB achieve the same nearly optimal regret upper bound on the order of . Compared with state-of-the-art algorithms that adopt passively adaptive approach, such as CascadeSWUCB and CascadeDUCB, our new regret upper bounds are reduced by and respectively. Numerical tests on both synthetic and real-world data show the improved efficiency of the proposed algorithms.
Several interesting questions are still left open for future work. Current regret lower bound has no dependency on the size of ground set and the size of list , which otherwise may lead to a tighter lower bound. Another challenging problem lies in whether the gap in time steps between regret upper bound and lower bound can be closed. Finally, we will extend the single click models to multiple clicks models in future work.
- Auer et al. (2002a) Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235–256.
- Auer et al. (2002b) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77.
- Auer et al. (2019) Auer, P., Gajane, P., and Ortner, R. (2019). Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Proc. 32nd Conf. on Learn. Theory (COLT’19), pages 138–158.
- Besbes et al. (2014) Besbes, O., Gur, Y., and Zeevi, A. (2014). Stochastic multi-armed-bandit problem with non-stationary rewards. In Proc. 24th Annu. Conf. Neural Inf. Process. Syst. (NeurIPS’14), pages 199–207.
- Besson and Kaufmann (2019) Besson, L. and Kaufmann, E. (2019). The generalized likelihood ratio test meets klucb: an improved algorithm for piece-wise non-stationary bandits. arXiv preprint arXiv:1902.01575.
- Cao et al. (2019) Cao, Y., Wen, Z., Kveton, B., and Xie, Y. (2019). Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit. In Proc. 22nd Int. Conf. Artif. Intell. Stat. (AISTATS 2019), pages 418–427.
- Cappé et al. (2013) Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., Stoltz, G., et al. (2013). Kullback–leibler upper confidence bounds for optimal sequential allocation. Ann. Stat., 41(3):1516–1541.
- Cesa-Bianchi and Lugosi (2012) Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits. J. Comput. Syst. Sci., 78(5):1404–1422.
- Chen et al. (2016) Chen, W., Wang, Y., Yuan, Y., and Wang, Q. (2016). Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. J. Mach. Learn. Res., 17(1):1746–1778.
- Cheung et al. (2019) Cheung, W. C., Tan, V., and Zhong, Z. (2019). A thompson sampling algorithm for cascading bandits. In Proc. 22th Int. Conf. Artif. Intell. Stat. (AISTATS 2019), pages 438–447.
- Combes et al. (2015) Combes, R., Shahi, M. S. T. M., Proutiere, A., et al. (2015). Combinatorial bandits revisited. In Proc. 29th Annu. Conf. Neural Inf. Process. Syst. (NeurIPS’15), pages 2116–2124.
- Craswell et al. (2008) Craswell, N., Zoeter, O., Taylor, M., and Ramsey, B. (2008). An experimental comparison of click position-bias models. In Proc. 1st ACM Int. Conf. Web Search Data Min. (WSDM’08), pages 87–94. ACM.
- Draglia et al. (1999) Draglia, V., Tartakovsky, A. G., and Veeravalli, V. V. (1999). Multihypothesis sequential probability ratio tests. i. asymptotic optimality. IEEE Trans. Inf. Theory, 45(7):2448–2461.
- Dupret and Piwowarski (2008) Dupret, G. E. and Piwowarski, B. (2008). A user browsing model to predict search engine click data from past observations. In Proc. 31st Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retrieval (SIGIR’08), pages 331–338. ACM.
- Garivier and Cappé (2011) Garivier, A. and Cappé, O. (2011). The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proc. 24nd Conf. on Learn. Theory (COLT’11), pages 359–376.
- Garivier and Moulines (2011) Garivier, A. and Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems. In Proc. 22th Int. Conf. Algorithmic Learning Theory (ALT’11), pages 174–188. Springer.
- Hadjiliadis and Moustakides (2006) Hadjiliadis, O. and Moustakides, V. (2006). Optimal and asymptotically optimal cusum rules for change point detection in the brownian motion model with multiple alternatives. Theory Probab. Appl., 50(1):75–85.
- Hartland et al. (2007) Hartland, C., Baskiotis, N., Gelly, S., Sebag, M., and Teytaud, O. (2007). Change point detection and meta-bandits for online learning in dynamic environments. CAp, pages 237–250.
- Hinkley (1971) Hinkley, D. V. (1971). Inference about the change-point from cumulative sum tests. Biometrika, 58(3):509–523.
- Jagerman et al. (2019) Jagerman, R., Markov, I., and de Rijke, M. (2019). When people change their mind: Off-policy evaluation in non-stationary recommendation environments. In Proc. 12th ACM Int. Conf. Web Search Data Min. (WSDM’19), pages 447–455. ACM.
- Kaufmann and Koolen (2018) Kaufmann, E. and Koolen, W. (2018). Mixture martingales revisited with applications to sequential tests and confidence intervals. arXiv preprint arXiv:1811.11419.
- Kocsis and Szepesvári (2006) Kocsis, L. and Szepesvári, C. (2006). Discounted ucb. In 2nd PASCAL Challenges Workshop, volume 2.
- Kveton et al. (2015) Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. (2015). Cascading bandits: Learning to rank in the cascade model. In Proc. 32th Int. Conf. Mach. Learn. (ICML 2015), pages 767–776.
- Lai and Robbins (1985) Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. Appl. Math., 6(1):4–22.
- Li et al. (2019) Li, B., Chen, T., and Giannakis, G. B. (2019). Bandit online learning with unknown delays. In Proc. 22th Int. Conf. Artif. Intell. Stat. (AISTATS 2019), pages 993–1002.
- Li and de Rijke (2019) Li, C. and de Rijke, M. (2019). Cascading non-stationary bandits: Online learning to rank in the non-stationary cascade model. arXiv preprint arXiv:1905.12370.
- Li et al. (2011) Li, L., Chu, W., Langford, J., and Wang, X. (2011). Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proc. 4th ACM Int. Conf. Web Search Data Min. (WSDM’11), pages 297–306. ACM.
- Li et al. (2016) Li, S., Karatzoglou, A., and Gentile, C. (2016). Collaborative filtering bandits. In Proc. 39th Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retrieval (SIGIR’16), pages 539–548. ACM.
- Littlestone and Warmuth (1994) Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Inf. Comput., 108(2):212–261.
- Liu et al. (2018) Liu, F., Lee, J., and Shroff, N. (2018). A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proc. 32nd AAAI Conf. Artif. Intell (AAAI’19).
- Lorden et al. (1971) Lorden, G. et al. (1971). Procedures for reacting to a change in distribution. Ann. Math. Stat., 42(6):1897–1908.
- Mellor and Shapiro (2013) Mellor, J. and Shapiro, J. (2013). Thompson sampling in switching environments with bayesian online change detection. In Proc. 16th Int. Conf. Artif. Intell. Stat. (AISTATS 2013), pages 442–450.
- Moustakides et al. (1986) Moustakides, G. V. et al. (1986). Optimal stopping times for detecting changes in distributions. Ann. Stat., 14(4):1379–1387.
- Page (1954) Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41(1/2):100–115.
- Pereira et al. (2018) Pereira, F. S., Gama, J., de Amo, S., and Oliveira, G. M. (2018). On analyzing user preference dynamics with temporal social networks. Mach. Learn., 107(11):1745–1773.
- Siegmund (2013) Siegmund, D. (2013). Sequential analysis: tests and confidence intervals. Springer Science & Business Media.
- Siegmund and Venkatraman (1995) Siegmund, D. and Venkatraman, E. (1995). Using the generalized likelihood ratio statistic for sequential detection of a change-point. Ann. Stat., pages 255–271.
- Wang et al. (2015) Wang, C., Liu, Y., Wang, M., Zhou, K., Nie, J.-y., and Ma, S. (2015). Incorporating non-sequential behavior into click models. In Proc. 38th Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retrieval (SIGIR’15), pages 283–292. ACM.
- Wei and Srivatsva (2018) Wei, L. and Srivatsva, V. (2018). On abruptly-changing and slowly-varying multiarmed bandit problems. In Proc. Am. Contr. Conf. (ACC 2018), pages 6291–6296. IEEE.
- Willsky and Jones (1976) Willsky, A. and Jones, H. (1976). A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans. Autom. Control, 21(1):108–112.
- Yu and Mannor (2009) Yu, J. Y. and Mannor, S. (2009). Piecewise-stationary bandit problems with side observations. In Proc. 26th Int. Conf. Mach. Learn. (ICML 2009), pages 1177–1184. ACM.
- Yue et al. (2010) Yue, Y., Gao, Y., Chapelle, O., Zhang, Y., and Joachims, T. (2010). Learning more powerful test statistics for click-based retrieval evaluation. In Proc. 33rd Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retrieval (SIGIR’10), pages 507–514. ACM.
- Zhou et al. (2019) Zhou, H., Wang, L., Varshney, L. R., and Lim, E.-P. (2019). A near-optimal change-detection based algorithm for piecewise-stationary combinatorial semi-bandits. arXiv preprint arXiv:1908.10402.
- Zoghi et al. (2017) Zoghi, M., Tunys, T., Ghavamzadeh, M., Kveton, B., Szepesvari, C., and Wen, Z. (2017). Online learning to rank in stochastic click models. In Proc. 34th Int. Conf. Mach. Learn. (ICML 2017), pages 4199–4208.
Appendix A Detailed Proofs of Theorem 1
a.1 Proof of Lemma 1
Proof of Lemma 1.
Denote as the regret of the learning algorithm at time , where is the recommended list at time and is the associated expected attraction vector at time . By further denoting as the first change-point detection time of the Bernoulli GLRT, the regret of GLRT-CascadeUCB can be decomposed as:
where inequality (a) holds due to the fact that and .
In order to bound the term (b), we denote the event as the algorithm being in the forced uniform exploration phase and let be the event that is not in the high-probability confidence interval around , where is expected attraction of item in the first piecewise-stationary segment, is the sample mean of item up to time , and is the number of times that item is observed up to time . Term (b) can be further decomposed as
where inequality (c) is because of the fact that and the uniform exploration probability is . Term (d) can be bounded by applying the Chernoff-Hoeffding inequality,
Furthermore, term (e) can be bounded as follows,
where the inequality follows the proof of Theorem 2 in Kveton et al. (2015). By summing all terms, we prove the result. ∎
a.2 Proof of Lemma 2
Proof of Lemma 2.
Define as the first change-point detection time of the th item. Then, . Since the global restart is adopted, by applying the union bound, we have that
Recall the GLR statistic defined in (4), and plug it into , we have that
where is the mean of the rewards generated from the distribution with expected reward from time step to . Inequality (a) is because of the fact that
inequality (b) is because of the union bound; inequality (c) is because of the Lemma 10 in Besson and Kaufmann (2019); and inequality (d) holds due to the Riemann zeta function and when , . Thus, we conclude by . ∎
a.3 Proof of Theorem 1
Define good events and , . Recall the definition of the good event that all the change-points up to th one have been detected correctly and quickly in (11), and we can find that . Again, we denote as the regret of the learning algorithm at time . By first decomposing the expected cumulative regret with respect to the event , we have that