Be Aware of NonStationarity: Nearly Optimal Algorithms for PiecewiseStationary Cascading Bandits
Abstract
Cascading bandit (CB) is a variant of both the multiarmed bandit (MAB) and the cascade model (CM), where a learning agent aims to maximize the total reward by recommending out of items to a user. We focus on a common realworld scenario where the user’s preference can change in a piecewisestationary manner. Two efficient algorithms, GLRTCascadeUCB and GLRTCascadeKLUCB, are developed. The key idea behind the proposed algorithms is incorporating an almost parameterfree changepoint detector, the Generalized Likelihood Ratio Test (GLRT), within classical upper confidence bound (UCB) based algorithms. Gapdependent regret upper bounds of the proposed algorithms are derived and both match the lower bound up to a polylogarithmic factor in the number of time steps . We also present numerical experiments on both synthetic and realworld datasets to show that GLRTCascadeUCB and GLRTCascadeKLUCB outperform stateoftheart algorithms in the literature.
1 Introduction
Online recommendation (Li et al., 2016) and web search (Dupret and Piwowarski, 2008; Zoghi et al., 2017) are important in the modern economy. Based on a user’s browsing history, these systems strive to maximize satisfaction and minimize the regret by presenting the user with a list of items (e.g., web pages and advertisements) that meet her/his preference. We focus on the popular cascading bandit (CB) model (Kveton et al., 2015), which is a variant of both the multiarmed bandit (MAB) (Auer et al., 2002a) and the cascade model (CM) (Craswell et al., 2008). In the CB model, the learning agent aims to identify the most attractive items out of total items contained in the ground set. At each time, the learning agent recommends a ranked list of items and receives the reward and feedback from the user. The goal of the agent is to maximize the total reward.
Existing works on CB (Kveton et al., 2015; Cheung et al., 2019) and MAB (Lai and Robbins, 1985; Auer et al., 2002a; Li et al., 2019) can be categorized according to whether stationary or nonstationary environment is studied. The stationary environment refers to the scenario where the reward distributions of arms (in MAB) or attraction distributions of items (in CB) do not evolve over time. On the other hand, a nonstationary environment is prevalent in realworld applications such as web search, online advertisement, and recommendation (Jagerman et al., 2019; Yu and Mannor, 2009; Pereira et al., 2018). If algorithms designed for stationarity are directly applied to a nonstationary environment, linear regret may occur (Li and de Rijke, 2019; Garivier and Moulines, 2011). Two types of nonstationary environment models are proposed and studied in the literature. One is adversarial environment (Auer et al., 2002b; Littlestone and Warmuth, 1994), whereas the other is piecewisestationary environment. The piecewisestationary environment is introduced in prior works on MAB (Hartland et al., 2007; Kocsis and Szepesvári, 2006; Garivier and Moulines, 2011), where the user’s preference remains stationary in certain time periods, named piecewisestationary segments, but can shift abruptly at some unknown time steps, called changepoints. In this paper, we focus on the piecewisestationary environment since it models realworld applications better. For instance, in recommendation systems, user’s preference for an item is unlikely to be invariant (stationary environment) or change significantly at each time step (adversarial environment).
To address the piecewisestationary MAB, two types of approaches have been proposed in the literature: passively adaptive approaches (Garivier and Moulines, 2011; Besbes et al., 2014; Wei and Srivatsva, 2018) and actively adaptive approaches (Cao et al., 2019; Liu et al., 2018; Besson and Kaufmann, 2019; Auer et al., 2019). The passively adaptive approaches make the decisions based on the most recent observations and are unaware of when a changepoint occurs. For active adaption, a changepoint detection algorithm such as CUSUM (Page, 1954), Page Hinkley Test (PHT) (Hinkley, 1971), Generalized Likelihood Ratio Test (GLRT) (Willsky and Jones, 1976) or Sliding Window (SW) (Cao et al., 2019) is included. It has been demonstrated in MAB that actively adaptive approaches outperform passively adaptive approaches via extensive numerical experiments (Mellor and Shapiro, 2013; Cao et al., 2019; Liu et al., 2018). However, for CB, only passively adaptive approaches have studied in the literature (Li and de Rijke, 2019). Specifically, our main contributions are summarized as follows.

Unlike previous passively adaptive algorithms, such as CascadeDUCB and CascadeSWUCB (Li and de Rijke, 2019), we propose two actively adaptive algorithms, GLRTCascadeUCB and GLRTCascadeKLUCB by incorporating an efficient changepoint detection component, the GLRT, within upper confidence bound (UCB) (Auer et al., 2002a) and Kullback–Leibler UCB (KLUCB) (Garivier and Cappé, 2011; Cappé et al., 2013) algorithms. The GLRT is almost parameterfree as compared to changepoint detection methods used in previous nonstationary bandit literature (Liu et al., 2018; Cao et al., 2019).

We derive gapdependent upper bounds on the regret of the proposed GLRTCascadeUCB and GLRTCascadeKLUCB. When the number of piecewisestationary segments is known, regret of is established for both algorithms, where is the number of items and is the number of time steps. When is unknown, the regret is for both algorithms. Compared to the best existing passively adaptive algorithm CascadeSWUCB (Li and de Rijke, 2019), whose regret is , the proposed algorithms improve the dependence on the regret bound.

The efficiency of proposed GLRTCascadeUCB and GLRTCascadeKLUCB relative to other stateoftheart algorithms is demonstrated on both synthetic and realworld datasets.
Compared to recent works on piecewisestationary MAB (Besson and Kaufmann, 2019) and combinatorial MAB (CMAB) (Zhou et al., 2019) that adopt GLRT as the changepoint detector, the problem setting considered in this paper is different. In MAB, only one selected item rather than a list of items is allowed at each time. Notice that although CMAB (Combes et al., 2015; CesaBianchi and Lugosi, 2012; Chen et al., 2016) also allow a list of items each time, they have full feedback on all items under semibandit setting. Yet in most cases, the learning agent can only observe partial feedback, which will be clearer later. Furthermore, we develop the analysis of both UCBbased and KLUCB based algorithms for CB, whereas only one of them (either UCBbased or KLUCB based algorithm) is analyzed in Besson and Kaufmann (2019) and Zhou et al. (2019).
The remainder of the paper is organized as follows. We describe the problem formulation in Section 2. The proposed algorithms, GLRTCascadeUCB and GLRTCascadeKLUCB, are explained in detail in Section 3. We prove upper bounds on the regret of the proposed algorithms in Section 4 and present the numerical experiment is Section 5. Finally, we conclude the paper in Section 6.
2 Problem Formulation
2.1 Cascade Model
Before introducing the piecewisestationary CB, we will first briefly review the cascade model in this subsection.
The CM (Craswell et al., 2008) is prevalent for explaining user’s behavior (e.g., click data) in web search and online advertising. In CM, the ground set that contains all items (e.g., all web pages or advertisements) is denoted as . Per slot, the user is presented with a item ranked list by the learning agent, where is the set of all permutations of the ground set with cardinality of . The user browses the list from the first item in order and clicks the item that attracts her/him. If the user is attracted by item , the user will click on it and will not browse the remaining items (multiclick cases (Wang et al., 2015; Yue et al., 2010) are beyond the scope of this paper). Otherwise, if the user is not attracted by the item , the user will browse item until the last item in the list. During browsing, the item attracts the user with probability after the user browses it. We further pose a reasonable assumption on as follows.
Assumption 1.
The attraction probability of item is independent of other items, where is the associated attraction probability vector of .
After user clicks on , the index of is observed by the learning agent and used to learn the user’s preference. Note that upon receiving the feedback, we can determine that are browsed but not attractive, is browsed and attractive, and are unobserved by the user.
2.2 PiecewiseStationary Cascading Bandit Problem
A piecewisestationary CB is characterized by a tuple , where is a sequence of time steps. The attraction of item at time is modeled as a Bernoulli random variable , with containing all the attractions of the ground set. In our notational convention we use to indicate item is attractive to the user, and the pmf of is . In a piecewisestationary CB, changes across in a piecewisestationary manner. Clearly, the are parameterized by attraction probability vector . In addition, we have
To formally define the piecewisestationary environment, the number of piecewisestationary segments is defined as
(1) 
where is the indicator function. Note that when a changepoint occurs, at least one item changes its attraction distribution. Hence, asynchronous attraction distribution changes are allowed. By the definition in (1), the number of changepoints is , and the changepoints are denoted as . Specifically, and are defined for consistency. For each piecewisestationary segment , and are adopted to denote the attraction distribution and the expected attraction of item on the th piecewisestationary segment respectively, where is the vector that contains the expected attractions of all items in the th segment.
Recommendation proceeds as follows. At time , the agent recommends a list of items , where the list is decided based on the feedback of the user up to time . Here, the user’s feedback at time is formulated as:
The reward at time can be written as,
(2) 
The agent’s goal is to maximize the cumulative reward across . Equivalently, the agent’s policy is evaluated by its expected cumulative regret:
(3) 
where with
is the optimal list that maximizes the expected reward at time , and the expectation is taken with respect to s and the selection of s. Under this setting, the optimal list is the list that maximizes the probability that at least one item is attractive in the recommended list, which is equivalent to the most attractive items at time . Since the reward defined in (2) is invariant to permutations of , there are optimal list at each time . Note that remains the same up to a permutation during a piecewisestationary segment unless a changepoint occurs.
2.3 Generalized Likelihood Ratio Test for Bernoulli Distribution
Sequential changepoint detection is of fundamental importance in statistical sequential analysis, however, most existing algorithms have additional assumptions on both prechange and postchange distributions (Hadjiliadis and Moustakides, 2006; Siegmund, 2013; Draglia et al., 1999; Siegmund and Venkatraman, 1995) or even require both the prechange and postchange distributions to be known (Lorden et al., 1971; Moustakides et al., 1986). However, these approaches are not applicable to CB, since the distributions are unknown to the agent and must be learned. In general, with prechange and postchange distributions unknown, developing algorithms with provable guarantees is challenging. Several approaches, however, have recently appeared in the literature (Liu et al., 2018; Cao et al., 2019; Besson and Kaufmann, 2019). Here we adopt the GLRT (Besson and Kaufmann, 2019) (See Algorithm 1). Compared to other existing changepoint detection methods that have provable guarantees, advantages of GLRT are twofold: 1) Fewer tuning parameters. The only required parameter for GLRT is the confidence level of changepoint detection , while CUSUM (Liu et al., 2018) and SW (Cao et al., 2019) have three and two parameters to be manually tuned, respectively. 2) Less needed prior knowledge. Whereas both CUSUM and SW require the smallest magnitude among the changepoints as prior knowledge, GLRT does not.
Next, we consider the GLRT. Suppose we have a sequence of Bernoulli random variables and aim to determine if a changepoint exists as soon as possible. Under Bernoulli distribution, this problem can be formulated as a parametric sequential test of the following two hypotheses:
where is the Bernoulli distribution with mean . The Bernoulli GLR statistic is defined as
(4) 
where is the empirical mean of observations from to , and is the Kullback–Leibler (KL) divergence of two Bernoulli distributions,
The detection time of Bernoulli GLRT changepoint detection for a length sequence with threshold is
where
(5) 
and has the same definition as that in (13) of Kaufmann and Koolen (2018).To better understand the performance of GLRT, it is instructive to use an example.
Example 1 (Efficiency of GLRT).
Consider a sequence of Bernoulli random variables with , where are generated from Bern(0.2) and the remaining ones are generated from Bern(0.8), as shown in Figure 1. By choosing , the expectation of detection time after 100 Monte Carlo trials is .
3 Algorithms
The proposed algorithms, GLRTCascadeUCB and GLRTCascadeKLUCB, are presented in Algorithm 2, which are motivated by Kveton et al. (2015); Besson and Kaufmann (2019). Here, we denote the last detection time as . The number of observations and its sample mean from the th item after are denoted as and , respectively. Three phases comprise the proposed algorithms.
Phase 1: Forced uniform exploration to ensure that sufficient samples are gathered for all items to perform the Bernoulli GLRT detection (Algorithm 1).
Phase 2: UCBbased exploration (UCB or KLUCB) to learn the optimal list on each piecewisestationary segment.
Phase 3: Bernoulli GLRT changepoint detection (Algorithm 1) to monitor if global restart should be triggered.
The proposed algorithms only require the time steps , the ground set , the number of items in list , the uniform exploration probability , and confidence level as inputs. The choices of and will be discussed in Section 4, but here we want to emphasize that is the only parameter needed in GLRT, whereas relates to uniform exploration in bandit problems and also appears in other algorithms (Liu et al., 2018; Cao et al., 2019).
We discuss the proposed algorithms in detail here. The algorithm determines whether to perform a uniform exploration or a UCBbased exploration depending on whether line 3 of Algorithm 2 is satisfied, which ensures the fraction of time steps performing the uniform exploration phase is about . If the uniform exploration is triggered, the first item in the recommended list will be item , and the remaining items in the list are chosen uniformly at random (line 4), which ensures item will be observed by the user. If UCBbased exploration is adopted at time , the algorithms will choose items (line 6) with largest UCB indices,
(6) 
By recommending the list and observing the user’s feedback (line 8), we update the statistics (line 10) and perform the Bernoulli GLRT detection (line 11). If the Bernoulli GLRT detection says True, we set for all , and (line 12). Finally, the UCB indices of each item are computed as follows (line 17),
(7)  
(8) 
where , and . Notice that (7) is the UCB indices of GLRTCascadeUCB, and (8) is the UCB indices of GLRTCascadeKLUCB. For the intuitions behind, we refer the readers to Proof of Theorem 1 in Auer et al. (2002a) and Proof of Theorem 2 in Cappé et al. (2013).
4 Performance Analysis
The step regret of the proposed GLRTCascadeUCB and GLRTCascadeKLUCB will be derived in this section. Upper bounds on the regret of GLRTCascadeUCB and GLRTCascadeKLUCB are developed in Section 4.1 and 4.2, respectively. Discussions of our theoretical guarantees are in Section 4.3.
Without loss of generality, for the th piecewisestationary segment, the ground set is first sorted in decreasing order according to attraction probabilities, that is , for all . The optimal list at th segment is thus all the permutations of list . Item is optimal if , otherwise item is suboptimal if . To simplify the exposition, the gap between the attraction probabilities of suboptimal item and the optimal item at th segment is defined as:
Similarly, the largest amplitude change among items at changepoint is defined as
with . We have the following assumption for the theoretical analysis.
Assumption 2.
Define and assume , , with .
The implication of Assumption 2 is , as one can find in Appendix A.4. Note that Assumption 2 is standard in piecewisestationary environment, and similar or same assumption is made in other changedetection based bandit algorithms (Liu et al., 2018; Cao et al., 2019; Besson and Kaufmann, 2019) as well. It requires the length of the piecewisestationary segment between two changepoints to be large enough. Assumption 2 guarantees that with high probability all the changepoints are detected within the interval , which is equivalent to saying all changepoints are detected correctly (low probability of false alarm) and quickly (low detection delay). This result be formally stated in Lemma 3. In our simulations, the proposed algorithms work well even when Assumption 2 does not hold.
4.1 Regret Upper Bound for GLRTCascadeUCB
Upper bound on the regret of GLRTCascadeUCB is as follows.
Theorem 1.
Proof.
The theorem is proved in Appendix A.3. ∎
Theorem 1 indicates that the upper bound on the regret of GLRTCascadeUCB is incurred by two types of costs that are further decomposed into four terms. Terms (a) and (b) upper bound the costs of UCBbased exploration and uniform exploration, respectively. Another type of cost is from changepoint detection, where the costs incurred by detection delay and the incorrect detections are bounded by terms (c) and (d). Corollary 1 follows directly from Theorem 1.
Corollary 1.
Let denote the smallest magnitude of any changepoint on any item, and be the smallest magnitude of a suboptimal gap on any one of the stationary segments. The regret of GLRTCascadeUCB is established depending whether one has prior information of ,

[topsep=3pt,leftmargin=12pt]

( known): Choosing gives
(9) 
( unknown): Choosing gives
(10)
Proof.
Please refer to Appendix A.4 for proof. ∎
As a direct result of Theorem 1, the upper bounds on the regret of GLRTCascadeUCB in Corollary 1 consist of two terms, where the first is incurred by the UCBbased exploration and the second is from the changepoint detection component. As becomes larger, the regret is dominated by the cost of the changepoint detection component, implying the regret is or . Similar phenomena can also be found in piecewisestationary MAB (Liu et al., 2018; Cao et al., 2019; Besson and Kaufmann, 2019).
The proof outline of Theorem 1 is presented in the following, which is based on a recursive method given the Lemmas 1 to 4. We start by upper bounding the regret under the stationary scenario with , , and .
Lemma 1.
Under stationary scenario (), the regret of GLRTCascadeUCB is upper bounded as
where is the first detection time.
Proof.
Proof is presented in Appendix A.1. ∎
Then we bound the false alarm probability in Lemma 1 under previously mentioned stationary scenario.
Lemma 2.
Consider the stationary scenario, with confidence level for the Bernoulli GLRT, and we have that
Proof.
Please refer to Appendix A.2 for proof. ∎
Next, we define the event that all the changepoints up to th have been detected quickly and correctly:
(11) 
Lemma 3 below shows happens with high probability.
Lemma 3.
When holds, GLRT with confidence level is capable of detecting the change point correctly and quickly with high probability, that is,
where is the detection time of th changepoint.
Proof.
Please refer to Lemma 12 in Besson and Kaufmann (2019). ∎
In the next lemma, we bound the expected detection delay with the good event holds.
Lemma 4.
The expected delay given is:
Proof.
By the definition of , the conditional expected delay is obviously upper bounded by . ∎
Thus, we can decompose into good events that GLRTCascadeUCB reinitializes the algorithm correctly and quickly after all changepoints and bad events that either large detection delays or false alarms happen. Notice that Lemmas 1 and 4 provide upper bounds on the regret of the stationary scenario and the detection delays of good events, respectively. Lemmas 2 and 3 show that with high probability, all changepoints can be detected correctly and quickly, and thus lead to upper bounds of bad events. By summing up all regrets from good events and bad events, an upper bound on the regret of GLRTCascadeUCB is then developed. Detailed steps can be found in Appendix A.
4.2 Regret Upper Bound for GLRTCascadeKLUCB
In this subsection, we develop the upper bound on the step regret of GLRTCascadeKLUCB.
Theorem 2.
Proof.
Please refer to Appendix B.1 for proof. ∎
Similarly, the upper bound on the regret of GLRTCascadeKLUCB in Theorem 2 can be decomposed into four different terms, where (a) is incurred by the incorrect changepoint detections, (b) is the cost of the uniform exploration, (c) is incurred by the changepoint detection delay, and (d) is the cost of the KLUCB based exploration.
Corollary 2.
Proof.
The proof is very similar to that of Corollary 1. ∎
We sketch the proof for Theorem 2 as follows, and the detailed proofs are presented in Appendix B. By defining the events and as the algorithm performing uniform exploration and the changepoints can be detected correctly and quickly, we can first bound the cost of uniform exploration and cost of incorrect and slow detection of changepoints , which can be realized by applying similar proof techniques as for Lemma 1 and 3. Then, we can divide the regret into different piecewisestationary segments. By bounding the cost of detection delays (using Lemmas 4) and the KLUCB based exploration, the upper bound on regret is thus established.
4.3 Discussion
Corollaries 1 and 2 reveal that by properly choosing the confidence level and the uniform exploration probability , the regrets of GLRTCascadeUCB and GLRTCascadeKLUCB can be upper bounded by (when is unknown)
(12) 
where notation hides the gap term and the lower order term . Notice that the upper bound in (12) does not require knowledge of the number of piecewisestationary segments . On the other hand, if is known, a better upper bound can be achieved,
(13) 
where the dependence on is improved to compared with (12). Note that compared to CUSUM in Liu et al. (2018) and SW in Cao et al. (2019), the tuning parameters are fewer and does not require the smallest magnitude among the changepoints as shown in Corollary 1. Moreover, parameter and follow simple rules as shown in Corollary 1, while complicated parameter tuning steps are required in CUSUM and SW.
The upper bounds on the regret of GLRTCascadeUCB and GLRTCascadeKLUCB are improved over stateoftheart algorithms CascadeDUCB and CascadeSWUCB in Li and de Rijke (2019) either in the dependence on or both and , as their upper bounds are and respectively. In realworld applications, both and can be huge, for example, and are in the millions in web search, where the improvements are significant. Furthermore, compared with lower bound in Li and de Rijke (2019), the proposed two algorithms are nearly optimal in up to a polylogarithmic factor .
5 Experiments
In this section, numerical experiments on both synthetic and realworld datasets are carried out to show the performances of proposed algorithms relative to stateoftheart ones. To be more specific, four baseline algorithms and two oracle algorithms are included in the experiments, where CascadeUCB1 (Kveton et al., 2015) and CascadeKLUCB (Kveton et al., 2015) are nearly optimal algorithms for CB under stationary environment; CascadeDUCB (Li and de Rijke, 2019) and CascadeSWUCB (Li and de Rijke, 2019) are algorithms adopting passively adaptive approach for piecewisestationary CB; OracleCascadeUCB1 and OracleCascadeKLUCB are oracle algorithms that know exactly when the changepoints occur and thus are capable of restarting the algorithms immediately after the changepoints. The goal is to identify the most attractive items and maximize the expected number of clicks. Based on the theoretical analysis in Li and de Rijke (2019), we choose , for CascadeDUCB and choose for CascadeSWUCB. For GLRTCascadeUCB and GLRTCascadeKLUCB, we set and set for both synthetic and realworld datasets.
5.1 Synthetic Dataset
In this experiment, let and . We consider a simulated piecewisestationary environment setup as follows: 1) the expected attractions of the top items remain constant over the whole time horizon; 2) in each even piecewisestationary segment, three suboptimal items are chosen randomly and their expected attractions are set to be ; 3) in each odd piecewisestationary segment, we reset the expected attractions to the initial state. In this experiment, we set the length of each piecewisestationary segment to be and choose , which is in total of steps. A detailed depiction of the piecewisestationary environment can be found in Figure 2.
We report the step cumulative regrets of all the algorithms by taking the average of the regrets over Monte Carlo simulations in Figure 3. Meanwhile, the means and standard deviations of the step regrets of all algorithms on synthetic dataset are listed in Table 1. The results show that the proposed GLRTCascadeUCB and GLRTCascadeKLUCB achieve better performances than other algorithms and are very close to the oracle algorithms. Compared with the best existing algorithm CascadeSWUCB, GLRTCascadeUCB achieves a reduction of the cumulative regret and this fraction is for GLRTCascadeKLUCB, which is consistent with difference of empirical results between passively adaptive approach and actively adaptive approach in MAB. Notice that although CascadeDUCB seems to be adaptive to the changepoints, the performance is even worse than algorithms designed for stationary CB. The possible reasons are twofold: 1) The theoretical result shows that CascadeDUCB is worse than other algorithms for piecewisestationary CB by a factor; 2) the time steps is not long enough. It is worth mentioning that our experiment on this synthetic dataset violates the Assumption 2, as it would require more than time steps for each piecewisestationary segment. Surprisingly, the proposed algorithms are capable of detecting all the changepoints correctly with high probability and sufficiently fast in our experiments, as shown in Table 3.
5.2 Yahoo! Dataset
In this subsection, we adopt the benchmark dataset for the evaluation of bandit algorithms published by Yahoo!^{1}^{1}1Yahoo! Front Page Today Module User Click Log Dataset on https://webscope.sandbox.yahoo.com. This dataset, using binary values to indicate if there is a click or not, contains user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! (Li et al., 2011), where each item corresponds to one article. We preprocess the dataset by adopting the same method as Cao et al. (2019), where , and . To make the experiment nontrivial, several modifications are applied to the dataset: 1) the click rate of each item is enlarged by times; 2) the time horizon is reduced to , which is shown in Figure 4. The cumulative regrets of all algorithms by averaging Monte Carlo trials are presented in Figure 5, which shows the regrets of our proposed algorithms are just slightly above the oracle algorithms and significantly outperform other algorithms. The means and standard deviations of the step regrets of all algorithms on Yahoo! dataset are in Table 1. Again, although the Assumption 2 is not satisfied in Yahoo! dataset, GLRT based algorithms detect the changepoints correctly and quickly and detailed mean detection time of each changepoint with its standard deviation is in Table 3.
CascadeUCB1  CascadeKLUCB  CascadeDUCB  CascadeSWUCB  
Synthetic Dataset  
Yahoo! Experiment  
GLRTCascadeUCB  GLRTCascadeKLUCB  OracleCascadeUCB1  OracleCascadeKLUCB  
Synthetic Dataset  
Yahoo! Experiment 
Changepoints  2500  5000  7500  10000  12500 

GLRTCascadeUCB  
GLRTCascadeKLUCB  
Changepoints  15000  17500  20000  22500  
GLRTCascadeUCB  
GLRTCascadeKLUCB 
Changepoints  10000  20000  30000  40000 

GLRTCascadeUCB  
GLRTCascadeKLUCB  
Changepoints  50000  60000  70000  80000 
GLRTCascadeUCB  
GLRTCascadeKLUCB 
6 Conclusion
Two new active adaptive algorithms for piecewisestationary cascading bandit, namely GLRTCascadeUCB and GLRTCascadeKLUCB are developed in this work. Under mild assumptions, it is analytically established that GLRTCascadeUCB and GLRTCascadeKLUCB achieve the same nearly optimal regret upper bound on the order of . Compared with stateoftheart algorithms that adopt passively adaptive approach, such as CascadeSWUCB and CascadeDUCB, our new regret upper bounds are reduced by and respectively. Numerical tests on both synthetic and realworld data show the improved efficiency of the proposed algorithms.
Several interesting questions are still left open for future work. Current regret lower bound has no dependency on the size of ground set and the size of list , which otherwise may lead to a tighter lower bound. Another challenging problem lies in whether the gap in time steps between regret upper bound and lower bound can be closed. Finally, we will extend the single click models to multiple clicks models in future work.
References
 Auer et al. (2002a) Auer, P., CesaBianchi, N., and Fischer, P. (2002a). Finitetime analysis of the multiarmed bandit problem. Mach. Learn., 47(23):235–256.
 Auer et al. (2002b) Auer, P., CesaBianchi, N., Freund, Y., and Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77.
 Auer et al. (2019) Auer, P., Gajane, P., and Ortner, R. (2019). Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Proc. 32nd Conf. on Learn. Theory (COLT’19), pages 138–158.
 Besbes et al. (2014) Besbes, O., Gur, Y., and Zeevi, A. (2014). Stochastic multiarmedbandit problem with nonstationary rewards. In Proc. 24th Annu. Conf. Neural Inf. Process. Syst. (NeurIPS’14), pages 199–207.
 Besson and Kaufmann (2019) Besson, L. and Kaufmann, E. (2019). The generalized likelihood ratio test meets klucb: an improved algorithm for piecewise nonstationary bandits. arXiv preprint arXiv:1902.01575.
 Cao et al. (2019) Cao, Y., Wen, Z., Kveton, B., and Xie, Y. (2019). Nearly optimal adaptive procedure with change detection for piecewisestationary bandit. In Proc. 22nd Int. Conf. Artif. Intell. Stat. (AISTATS 2019), pages 418–427.
 Cappé et al. (2013) Cappé, O., Garivier, A., Maillard, O.A., Munos, R., Stoltz, G., et al. (2013). Kullback–leibler upper confidence bounds for optimal sequential allocation. Ann. Stat., 41(3):1516–1541.
 CesaBianchi and Lugosi (2012) CesaBianchi, N. and Lugosi, G. (2012). Combinatorial bandits. J. Comput. Syst. Sci., 78(5):1404–1422.
 Chen et al. (2016) Chen, W., Wang, Y., Yuan, Y., and Wang, Q. (2016). Combinatorial multiarmed bandit and its extension to probabilistically triggered arms. J. Mach. Learn. Res., 17(1):1746–1778.
 Cheung et al. (2019) Cheung, W. C., Tan, V., and Zhong, Z. (2019). A thompson sampling algorithm for cascading bandits. In Proc. 22th Int. Conf. Artif. Intell. Stat. (AISTATS 2019), pages 438–447.
 Combes et al. (2015) Combes, R., Shahi, M. S. T. M., Proutiere, A., et al. (2015). Combinatorial bandits revisited. In Proc. 29th Annu. Conf. Neural Inf. Process. Syst. (NeurIPS’15), pages 2116–2124.
 Craswell et al. (2008) Craswell, N., Zoeter, O., Taylor, M., and Ramsey, B. (2008). An experimental comparison of click positionbias models. In Proc. 1st ACM Int. Conf. Web Search Data Min. (WSDM’08), pages 87–94. ACM.
 Draglia et al. (1999) Draglia, V., Tartakovsky, A. G., and Veeravalli, V. V. (1999). Multihypothesis sequential probability ratio tests. i. asymptotic optimality. IEEE Trans. Inf. Theory, 45(7):2448–2461.
 Dupret and Piwowarski (2008) Dupret, G. E. and Piwowarski, B. (2008). A user browsing model to predict search engine click data from past observations. In Proc. 31st Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retrieval (SIGIR’08), pages 331–338. ACM.
 Garivier and Cappé (2011) Garivier, A. and Cappé, O. (2011). The klucb algorithm for bounded stochastic bandits and beyond. In Proc. 24nd Conf. on Learn. Theory (COLT’11), pages 359–376.
 Garivier and Moulines (2011) Garivier, A. and Moulines, E. (2011). On upperconfidence bound policies for switching bandit problems. In Proc. 22th Int. Conf. Algorithmic Learning Theory (ALT’11), pages 174–188. Springer.
 Hadjiliadis and Moustakides (2006) Hadjiliadis, O. and Moustakides, V. (2006). Optimal and asymptotically optimal cusum rules for change point detection in the brownian motion model with multiple alternatives. Theory Probab. Appl., 50(1):75–85.
 Hartland et al. (2007) Hartland, C., Baskiotis, N., Gelly, S., Sebag, M., and Teytaud, O. (2007). Change point detection and metabandits for online learning in dynamic environments. CAp, pages 237–250.
 Hinkley (1971) Hinkley, D. V. (1971). Inference about the changepoint from cumulative sum tests. Biometrika, 58(3):509–523.
 Jagerman et al. (2019) Jagerman, R., Markov, I., and de Rijke, M. (2019). When people change their mind: Offpolicy evaluation in nonstationary recommendation environments. In Proc. 12th ACM Int. Conf. Web Search Data Min. (WSDM’19), pages 447–455. ACM.
 Kaufmann and Koolen (2018) Kaufmann, E. and Koolen, W. (2018). Mixture martingales revisited with applications to sequential tests and confidence intervals. arXiv preprint arXiv:1811.11419.
 Kocsis and Szepesvári (2006) Kocsis, L. and Szepesvári, C. (2006). Discounted ucb. In 2nd PASCAL Challenges Workshop, volume 2.
 Kveton et al. (2015) Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. (2015). Cascading bandits: Learning to rank in the cascade model. In Proc. 32th Int. Conf. Mach. Learn. (ICML 2015), pages 767–776.
 Lai and Robbins (1985) Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. Appl. Math., 6(1):4–22.
 Li et al. (2019) Li, B., Chen, T., and Giannakis, G. B. (2019). Bandit online learning with unknown delays. In Proc. 22th Int. Conf. Artif. Intell. Stat. (AISTATS 2019), pages 993–1002.
 Li and de Rijke (2019) Li, C. and de Rijke, M. (2019). Cascading nonstationary bandits: Online learning to rank in the nonstationary cascade model. arXiv preprint arXiv:1905.12370.
 Li et al. (2011) Li, L., Chu, W., Langford, J., and Wang, X. (2011). Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In Proc. 4th ACM Int. Conf. Web Search Data Min. (WSDM’11), pages 297–306. ACM.
 Li et al. (2016) Li, S., Karatzoglou, A., and Gentile, C. (2016). Collaborative filtering bandits. In Proc. 39th Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retrieval (SIGIR’16), pages 539–548. ACM.
 Littlestone and Warmuth (1994) Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Inf. Comput., 108(2):212–261.
 Liu et al. (2018) Liu, F., Lee, J., and Shroff, N. (2018). A changedetection based framework for piecewisestationary multiarmed bandit problem. In Proc. 32nd AAAI Conf. Artif. Intell (AAAI’19).
 Lorden et al. (1971) Lorden, G. et al. (1971). Procedures for reacting to a change in distribution. Ann. Math. Stat., 42(6):1897–1908.
 Mellor and Shapiro (2013) Mellor, J. and Shapiro, J. (2013). Thompson sampling in switching environments with bayesian online change detection. In Proc. 16th Int. Conf. Artif. Intell. Stat. (AISTATS 2013), pages 442–450.
 Moustakides et al. (1986) Moustakides, G. V. et al. (1986). Optimal stopping times for detecting changes in distributions. Ann. Stat., 14(4):1379–1387.
 Page (1954) Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41(1/2):100–115.
 Pereira et al. (2018) Pereira, F. S., Gama, J., de Amo, S., and Oliveira, G. M. (2018). On analyzing user preference dynamics with temporal social networks. Mach. Learn., 107(11):1745–1773.
 Siegmund (2013) Siegmund, D. (2013). Sequential analysis: tests and confidence intervals. Springer Science & Business Media.
 Siegmund and Venkatraman (1995) Siegmund, D. and Venkatraman, E. (1995). Using the generalized likelihood ratio statistic for sequential detection of a changepoint. Ann. Stat., pages 255–271.
 Wang et al. (2015) Wang, C., Liu, Y., Wang, M., Zhou, K., Nie, J.y., and Ma, S. (2015). Incorporating nonsequential behavior into click models. In Proc. 38th Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retrieval (SIGIR’15), pages 283–292. ACM.
 Wei and Srivatsva (2018) Wei, L. and Srivatsva, V. (2018). On abruptlychanging and slowlyvarying multiarmed bandit problems. In Proc. Am. Contr. Conf. (ACC 2018), pages 6291–6296. IEEE.
 Willsky and Jones (1976) Willsky, A. and Jones, H. (1976). A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans. Autom. Control, 21(1):108–112.
 Yu and Mannor (2009) Yu, J. Y. and Mannor, S. (2009). Piecewisestationary bandit problems with side observations. In Proc. 26th Int. Conf. Mach. Learn. (ICML 2009), pages 1177–1184. ACM.
 Yue et al. (2010) Yue, Y., Gao, Y., Chapelle, O., Zhang, Y., and Joachims, T. (2010). Learning more powerful test statistics for clickbased retrieval evaluation. In Proc. 33rd Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retrieval (SIGIR’10), pages 507–514. ACM.
 Zhou et al. (2019) Zhou, H., Wang, L., Varshney, L. R., and Lim, E.P. (2019). A nearoptimal changedetection based algorithm for piecewisestationary combinatorial semibandits. arXiv preprint arXiv:1908.10402.
 Zoghi et al. (2017) Zoghi, M., Tunys, T., Ghavamzadeh, M., Kveton, B., Szepesvari, C., and Wen, Z. (2017). Online learning to rank in stochastic click models. In Proc. 34th Int. Conf. Mach. Learn. (ICML 2017), pages 4199–4208.
Appendices
Appendix A Detailed Proofs of Theorem 1
a.1 Proof of Lemma 1
Proof of Lemma 1.
Denote as the regret of the learning algorithm at time , where is the recommended list at time and is the associated expected attraction vector at time . By further denoting as the first changepoint detection time of the Bernoulli GLRT, the regret of GLRTCascadeUCB can be decomposed as:
where inequality (a) holds due to the fact that and .
In order to bound the term (b), we denote the event as the algorithm being in the forced uniform exploration phase and let be the event that is not in the highprobability confidence interval around , where is expected attraction of item in the first piecewisestationary segment, is the sample mean of item up to time , and is the number of times that item is observed up to time . Term (b) can be further decomposed as
where inequality (c) is because of the fact that and the uniform exploration probability is . Term (d) can be bounded by applying the ChernoffHoeffding inequality,
Furthermore, term (e) can be bounded as follows,
where the inequality follows the proof of Theorem 2 in Kveton et al. (2015). By summing all terms, we prove the result. ∎
a.2 Proof of Lemma 2
Proof of Lemma 2.
Define as the first changepoint detection time of the th item. Then, . Since the global restart is adopted, by applying the union bound, we have that
Recall the GLR statistic defined in (4), and plug it into , we have that
where is the mean of the rewards generated from the distribution with expected reward from time step to . Inequality (a) is because of the fact that
inequality (b) is because of the union bound; inequality (c) is because of the Lemma 10 in Besson and Kaufmann (2019); and inequality (d) holds due to the Riemann zeta function and when , . Thus, we conclude by . ∎
a.3 Proof of Theorem 1
Proof.
Define good events and , . Recall the definition of the good event that all the changepoints up to th one have been detected correctly and quickly in (11), and we can find that . Again, we denote as the regret of the learning algorithm at time . By first decomposing the expected cumulative regret with respect to the event , we have that
where the inequality (a) is because that can be bounded using Lemma 2 and inequality (b) holds due to Lemma 1. To bound the term (c), by applying the law of total expectation, we have that
where