A NearOptimal ChangeDetection Based Algorithm for PiecewiseStationary Combinatorial SemiBandits
Abstract
We investigate the piecewisestationary combinatorial semibandit problem. Compared to the original combinatorial semibandit problem, our setting assumes the reward distributions of base arms may change in a piecewisestationary manner at unknown time steps. We propose an algorithm, GLRCUCB, which incorporates an efficient combinatorial semibandit algorithm, CUCB, with an almost parameterfree changepoint detector, the Generalized Likelihood Ratio Test (GLRT). Our analysis shows that the regret of GLRCUCB is upper bounded by , where is the number of piecewisestationary segments, is the number of base arms, and is the number of time steps. As a complement, we also derive a nearly matching regret lower bound on the order of ), for both piecewisestationary multiarmed bandits and combinatorial semibandits, using informationtheoretic techniques and judiciously constructed piecewisestationary bandit instances. Our lower bound is tighter than the best available regret lower bound, which is . Numerical experiments on both synthetic and realworld datasets demonstrate the superiority of GLRCUCB compared to other stateoftheart algorithms.
1 Introduction
The multiarmed bandit (MAB) problem, first proposed by thompson1933likelihood, has been studied extensively in statistics and machine learning communities, as it models many online decision making problems such as online recommendation (li2016collaborative), computational advertising (tang2013automatic), and crowdsourcing task allocation (hassan2014multi). The classical MAB is modeled as an agent repeatedly pulling one of arms and observing the reward generated, with the goal of minimizing the regret which is the difference between the reward of the optimal arm in hindsight and the reward of the arm chosen by the agent. This classical problem is well understood for both stochastic (lai1985asymptotically) and adversarial settings (auer2002nonstochastic). The stochastic setting is when the reward of each arm is generated from a fixed distribution, and it is well known that the problemdependent lower bound is of order (lai1985asymptotically), where is the number of time steps. Several algorithms have been proposed and proven to achieve regret (agrawal2012analysis; auer2002finite). The adversarial setting is when at each time step the environment generates the reward in an adversarial manner, whose minimax regret lower bound is of order . Several algorithms achieving orderoptimal (up to polylogarithm factors) regret have been proposed in recent years (hazan2011better; bubeck2018sparsity; li2019bandit).
Many realworld applications, however, have a combinatorial nature that cannot be fully characterized by the classical MAB model. For example, online movie sites aim to recommend multiple movies to the users to maximize their utility under some constraints (e.g. recommend at most one movie for each category). This phenomenon motivates the study of combinatorial semibandits (CMAB), which aims to identify the best superarm, a set of base arms with highest aggregated reward. Several algorithms for stochastic CMAB with provable guarantees based on optimism principle have been proposed recently (chen2013combinatorial; kveton2015tight; combes2015combinatorial). All of these algorithms use some oracle to overcome the curse of dimension of the action space for solving some combinatorial optimization problem at each iteration. In addition, these algorithms achieve the optimal regret upper bound which is , where is an instancedependent parameter. Adversarial CMAB is also well studied. Several algorithms achieve optimal regret of order based on either FollowtheRegularizedLeader or FollowthePerturbedLeader, both of which are general frameworks for adversarial online learning algorithm design (hazan2016introduction). Moreover, one recent study has developed an algorithm that is orderoptimal for both stochastic and adversarial CMAB (zimmert2019beating).
Although both stochastic and adversarial CMAB are wellstudied, understanding of the scenario lying in the “middle” of these two settings is still limited. Such a “middle” setting where the reward distributions of base arms slowly change over time may be a more realistic model in many applications. For instance, in online recommendation systems, users’ preference are unlikely to be either timeinvariant or to change significantly and frequently over time. Thus in this case it would be too ideal to assume the stochastic CMAB model and too conservative to assume the adversarial CMAB model. Similar situations appear in web search, online advertisement, and crowdsourcing (yu2009piecewise; pereira2018analyzing; vempaty2014reliable). As such, we investigate a setting lying between these two standard CMAB models, namely piecewisestationary combinatorial semibandit, which we will define formally in Section 2. Piecewisestationary CMAB is a natural generalization of the piecewisestationary MAB model (hartland2007change; kocsis2006discounted; garivier2011upper), and can be interpreted as an approximation to the slowvarying CMAB problem. Roughly, compared to the stochastic CMAB, we assume reward distributions of base arms remain fixed for certain time periods called piecewisestationary segments, but can change abruptly at some unknown time steps, called changepoints.
Previous works on piecewisestationary MAB may be divided into two categories: passively adaptive approach (garivier2011upper; besbes2014stochastic; wei2018abruptly) and actively adaptive approach (cao2019nearly; liu2018change; besson2019generalized; auer2019adaptively). Passively adaptive approaches make decisions based on the most recent observations and are unaware of the underlying distribution changes. On the contrary, actively adaptive approaches incorporate a changepoint detector subroutine to monitor the reward distributions, and restart the algorithm once a changepoint is detected. Numerous empirical experiments have shown that actively adaptive approaches outperform passively adaptive approaches (mellor2013thompson), which motivates us to adopt an actively adaptive approach.
Our main contributions include the following:

We propose a simple and general algorithm for piecewisestationary CMAB, named GLRCUCB, which is based on CUCB (chen2013combinatorial) with a novel changepoint detector, the generalized likelihood ratio test (GLRT) (besson2019generalized). The advantage of GLRT changepoint detection is it is almost parameterfree and thus easy to tune compared to previously proposed changepoint detection methods used in nonstationary MAB, such as CUSUM (liu2018change) and SW (cao2019nearly).

For any combinatorial action set, we derive the problemdependent regret bound for GLRCUCB under mild conditions (see Section 4). When the number of change points is known beforehand, the regret of GLRCUCB is upper bounded by (nearly orderoptimal within polylogarithm factor in ), where is number of base arms. When is unknown, the algorithm achieves . Here, and are problemdependent constants which do not depend on , , or .

We derive a tighter minimax lower bound for both piecewisestationary MAB and piecewisestationary CMAB on the order of . Since piecewisestationary MAB is a special instance of piecewisestationary CMAB in which every superarm is a single arm, thus any minimax lower bound holds for piecewisestationary MAB also holds for piecewisestationary CMAB. To the best of our knowledge, this is the best existing minimax lower bound for piecewisestationary CMAB. Previously, the best available lower bound is (garivier2011upper), which does not depend on or .

We demonstrate that GLRCUCB performs significantly better than stateoftheart algorithms through experiments on both synthetic and realworld datasets.
The remainder of this paper is organized as follows: the formal problem formulation and some preliminaries are introduced in Section 2, then the proposed GLRCUCB algorithm in Section 3. We derive the upper bound on the regret of our algorithm in Section 4, and the minimax regret lower bound in Section 5. Section 6 gives our experiment results. Finally, we conclude the paper. Due to the page limitation, we postpone proofs and additional experimental results to the appendix.
2 Problem Formulation and Background
In this section, we start with the formal definition of piecewisestationary combinatorial semibandit as well as some technical assumptions in Section 2.1. Then, we introduce the GLR changepoint detector used in our algorithm design, and its advantage in Section 2.2.
2.1 PiecewiseStationary Combinatorial SemiBandits
A piecewisestationary combinatorial semibandit is characterized by a tuple . Here, is the set of base arms; is the set of all super arms; is a sequence of time steps; is the reward distribution of arm at time with mean and bounded support within ; is the expected reward function defined on the super arm and mean vector of all base arms at time . Like chen2013combinatorial, we assume the expected reward function satisfies the following two properties:
Assumption 2.1 (Monotonicity).
Given two arbitrary mean vectors and , if , then .
Assumption 2.2 (Lipschitz).
Given two arbitrary mean vectors and , there exists an such that , , where is the projection operator specified as in terms of the indicator function .
In the piecewise i.i.d. model, we define , the number of piecewisestationary segments in the reward process, to be
We denote these changepoints as respectively, and we let and . For each piecewisestationary segment , we use and to denote the reward distribution and the expected reward of arm on the th piecewisestationary segment, respectively. The vector encoding the expected rewards of all base arms at the th segment is denoted as , . Note that when a changepoint occurs, there must be at least one arm whose reward distribution has changed, however, the rewards distributions of all base arms do not necessarily change.
For a piecewicestationary combinatorial semibandit problem, at each time step , the learning agent chooses a super arm to play based on the rewards observed up to time . When the agent plays a super arm , the reward of base arms contained in super arm are revealed to the agent and the reward of super arm as well. We assume that the agent has access to an approximation oracle, to carry out combinatorial optimization, defined as follows.
Assumption 2.3 (approximation oracle).
Given a mean vector , the approximation oracle outputs an suboptimal super arm such that .
Remark 2.1.
The approximation oracle assumption was first proposed in chen2013combinatorial for the combinatorial semibandit setting. This assumption is reasonable since many combinatorial NPhard problems admit approximation algorithms, which can be solved efficiently in polynomial time (ausiello1995approximate). There are also many combinatorial problems which are not NP hard and can be solved efficiently. One example is the top arm identification problem in the bandit setting (cao2015top), where any efficient sorting algorithm suffices.
As only an approximation oracle is used for optimization, it is reasonable to use expected approximation cumulative regret to measure the performance of the learning agent, defined as follows.
Definition 2.4 (Expected approximation cumulative regret).
The agent’s policy is evaluated by its expected approximation cumulative regret,
where the expectation is taken with respect to the selection of .
2.2 Generalized Likelihood Ratio ChangePoint Detector for SubBernoulli Distribution
Sequential changepoint detection is a classical problem in statistical sequential analysis, but most existing works make additional assumptions on the prechange and postchange distributions which might not hold in the bandit setting (siegmund2013sequential; basseville1993detection). In general, designing algorithms with provable guarantees for change detection with little assumption on prechange and postchange distributions is very challenging. In our algorithm design, we will use the generalized likelihood ratio (GLR) changepoint detector (besson2019generalized), which works for any subBernoulli distribution. Compared to other existing change detection methods used in piecewisestationary MAB, the GLR detector has less parameters to be tuned and needs less prior knowledge for the bandit instance. Specifically, GLR only needs to tune the threshold for the changepoint detection, and does not require the smallest change in expectation among all changepoints. On the contrary, CUSUM (liu2018change) and SW (cao2019nearly) both need more parameters to be tuned and need to know the smallest magnitude among all changepoints beforehand, which limits their practicality.
To define GLR changepoint detector, we need some more definitions for clarity. A distribution is said to be subBernoulli if , where ; is the log moment generating function of a Bernoulli distribution with mean . Notice that the support of reward distribution , is a subset of the interval , thus all are subBernoulli distributions with mean , due to the following lemma.
Lemma 2.5 (Lemma 1 in cappe2013kullback).
Any distribution with bounded support within the interval is a subBernoulli distribution that satisfies:
Suppose we have either a time sequence drawn from a subBernoulli distribution for any or two subBernoulli distributions with an unknown changepoint . This problem can be formulated as a parametric sequential test:
The GLR statistic for subBernoulli distributions is:
(1) 
where is the mean of the observations collected between and , and is the binary relative entropy between Bernoulli distributions,
If the GLR in Eq. (1) is large, it indicates that hypothesis is more likely. Now, we are ready to define the subBernoulli GLR changepoint detector with confidence level .
Definition 2.6.
The subBernoulli GLR changepoint detector with threshold function is
where , and is as in Eq. (13) in kaufmann2018mixture.
The pseudocode of subBernoulli GLR changepoint detector is summarized in Algorithm 1 for completeness.
3 The GlrCucb Algorithm
Our proposed algorithm, GLRCUCB, incorporates an efficient combinatorial semibandit problem algorithm CUCB (chen2013combinatorial) with a changepoint detector running on each base arm (See Algorithm 2). The GLRCUCB requires the number of time steps , the number of base arms , uniform exploration probability , and the confidence level as inputs. Let denote the last changepoint detection time, and denote the number of observations of base arm after , which are initialized as zeros at the beginning of the algorithm.
At each time step, the GLRCUCB first determines if it will enter forced uniform exploration (to ensure each base arm collects sufficient samples for the changepoint detection) according to the condition in line 3. If it is in a forced exploration, a random super arm that contains (line 4) is played, to ensure sufficient number of samples are collected for every base arm. Otherwise, the next super arm to be played is determined by the approximation oracle (line 6) given the UCB indices (line 19). Then, the learning agent plays the super arm , and gets the reward of the super arm and the rewards of the base arm ’s that are contained in the super arm (line 8). In the next step, the algorithm updates the statistics for each base arm (lines 1011) in order to run the GLR changepoint detector (Algorithm 1) with confidence level (line 12). If the GLR changepoint detector detects a change in distribution for any of the base arms, the algorithm sets to be the current time step and all ’s to be 0 (line 1314) before going into time step . Lastly, the UCB indices of all base arms are updated (line 19).
Remark 3.1.
The uniform exploration is necessary for this algorithm, and similar strategy has been adopted in liu2018change; cao2019nearly. Intuitively, uniform exploration ensures each base arm gathers sufficient samples to guarantee quick change detection whereas pure UCB exploration is incapable of this. One more rigorous argument is given in garivier2011upper, which shows that theoretically pure UCB exploration performs badly on piecewisestationary MAB.
Remark 3.2.
Thompson sampling (TS) often performs better than UCB policy in empirical simulations, but it has been shown that one cannot incorporate an approximate oracle in TS for even MAB problems (wang2018thompson). Thus our algorithm adopts UCB policy for the bandit component to ensure compatibility with approximation oracle.
4 Regret Upper Bound
In this section, we analyze the step regret of our proposed algorithm GLRCUCB. Recall is the time horizon, is the number of piecewisestationary segments, are the changepoints, and for each segment , is the vector encoding the expected rewards of all base arms. A super arm is bad with respect to the th piecewisestationary distributions if . We define to be the set of bad super arms with respect to the th piecewisestationary segment. We define the suboptimality gap in the th stationary segment as follows:
Furthermore, let and be the maximum and minimum suboptimal gaps for the whole time horizon, respectively. Lastly, denote the largest gap at changepoint as , , and . We need the following assumption for our theoretical analysis.
Assumption 4.1.
Define and assume , , where .
Tuning and properly (See Corollary 4.3), and applying the upper bound on by kaufmann2018mixture with ,
the length of each piecewisestationary segment is . Roughly, we assume the length of each stationary segment to be sufficiently long, in order to let the GLR changepoint detector detect the change in distribution within a reasonable delay with high probability. Similar assumption on the length of stationary segments also appears in other literature on piecewise stationary MAB (liu2018change; cao2019nearly; besson2019generalized). Note that Assumption 4.1 is only required for the theoretical analysis; Algorithm 2 can be implemented regardless of this assumption. Now we are ready to state the regret upper bound for Algorithm 2.
Theorem 4.2.
Theorem 4.2 indicates that the regret comes from four sources. Terms (a) and (b) correspond to the cost of exploration, while terms (c) and (d) correspond to the cost of changepoint detection. More specifically, term (a) is due to UCB exploration, term (b) is due to uniform exploration, term (c) is due to expected delay of GLR changepoint detector, and term (d) is due to the false alarm probability of GLR changepoint detector. We need to carefully tune the exploration probability and false alarm probability to balance the tradeoff.
The following corollary comes directly from Theorem 4.2 by properly tuning the parameters in the algorithm.
Corollary 4.3.
Let , we have

( is known) Choosing , , gives ;

( is unknown) Choosing , , gives .
Remark 4.1.
The effect of oracle is to reduce the dependency on number of base arms from to during exploration (first term in the regret appeared in Corollary 4.3). In the worst case, can be exponential with respect to . Recall that if we use standard MAB algorithms for exploration, the dependency on is .
Remark 4.2.
As becomes larger, the regret is dominated by the cost of changepoint detection, which has similar order compared to the regret bound of piecewisestationary MAB algorithms. This is reasonable since our setting assumes that we have access to the reward of the base arms contained in the super arm played by the agent.
Remark 4.3.
When is large, the order of the regret bound is similar to the regret bound of adversarial bandit. But note that regret definition for adversarial bandit and piecewisestationary bandit is different. For the first case, the regret is evaluated with respect to one fixed arm which is optimal for the whole horizon. But for the second case, the regret is evaluated with respect to pointwise optimal arm, which is much more challenging.
We can use Corollary 4.3 as a guide for parameter tuning. The above corollary indicates that without knowledge of the number of changepoints , we pay a penalty of a factor of in the long run.
For the detailed proof of Theorem 4.2, see Appendix A. Here we sketch the proof; to do so we need some additional lemmas. We start by proving the regret of GLRCUCB under the stationary scenario.
Lemma 4.4.
Under the stationary scenario, i.e. , the approximation cumulative regret of GLRCUCB is upper bounded as:
The first term is due to the possible false alarm of the changepoint detection subroutine, the second term is due to the uniform exploration, and the last term is due to the UCB exploration. We upper bound the false alarm probability in Lemma 4.5, as follows.
Lemma 4.5 (False alarm probability in the stationary scenario).
Consider the stationary scenario, i.e. , with confidence level ; we have that
Remark 4.4.
By setting , we will have . Asymptotically, the false alarm probability will go to 0.
In the next lemma, we show the GLR changepoint detector is able to detect change in distribution reasonably well with high probability, given all previous changepoints were detected reasonably well. The formal statement is as follows.
Lemma 4.6.
(Lemma 12 in besson2019generalized) Define the event that all the changepoints up to th one have been detected successfully within a small delay:
(2) 
Then, , and , where is the detection time of the th changepoint.
Lemma 4.6 provides an upper bound for the conditional expected detection delay, given the good events .
Corollary 4.7 (Bounded conditional expected delay).
.
Given these lemmas, we can derive the regret upper bound for GLRCUCB in a recursive manner. Specifically, we prove Theorem 4.2 by recursively decomposing the regret into a collection of good events and bad events. The good events contain all sample paths that GLRCUCB reinitialize the UCB index of base arms after all changepoints correctly within a small delay. On the other hand, the bad events contain all sample paths where either GLR change detector fails to detect the change in distribution or detects the change with a large delay. The cost incurred given the good events can be upper bounded by Lemma 4.4 and Lemma 4.7. By upper bounding the probabilities of bad events via Lemma 4.5 and Lemma 4.6, the cost incurred given the bad events is analyzable. Detailed proofs are presented in Appendix A.
5 Regret Lower Bound
The lower bound for MAB problems has been studied extensively. Previously, the best available minimax lower bound for piecewisestationary MAB was by garivier2011upper. Note that piecewisestationary MAB is a special instance for piecewisestationary CMAB in which every super arm is a base arm, thus this lower bound still holds for piecewisestationary CMAB. We derive a tighter lower bound by characterizing the dependency on and .
Theorem 5.1.
If and , then the worstcase regret for any policy is lower bounded by
where , .
Proof.
(Sketch) The high level idea is to construct a randomized ‘hard’ instance which is appropriate to our setting (bubeck2012regret; besbes2014stochastic; lattimore2018bandit), then analyze its regret lower bound which holds for any exploration policy. The construction of this ‘hard’ instance is as follows.
We partition the time horizon into segments with equal length except for the last segment. In each segment, assume the rewards of all arms are Bernoulli distributions and stay unchanged. At each time step there is an optimal arm with expected reward of and the remaining arms have the same expected reward of . The optimal arm will change in two consecutive segments by sampling uniformly at random from the remaining arms.
We then use Lemma A.1 in auer2002nonstochastic to upper bound the expected number of pulls to any arm being optimal under change of distributions, from the suboptimal reward distribution to the optimal reward distribution (Bern( to Bern()). Given the upper bound of expected number of pulls to the optimal arm, we can lower bound the regret for any exploration policy. By properly tuning and after some additional steps, we can derive the minimax regret lower bound. The condition comes from the fact that the lower bound needs to be nontrivial, and comes from the tuning of .
For the detailed proof, please refer to Appendix B. ∎
The conditions for this minimax lower bound are mild, since in practice the number of base arms is usually much larger and we care about the longterm regret, in other words, large regime.
Our minimax lower bound shows that GLRCUCB is nearly orderoptimal with respect to all parameters. On the other hand, as a byproduct, this bound also indicates that EXP3S (auer2002nonstochastic) and MUCB (cao2019nearly) are nearly orderoptimal for piecewisestationary MAB, up to polylogarithm factors. To be more specific, EXP3S and MUCB achieve regret and respectively.
6 Experiments
We compare GLRCUCB with five baselines from the literature, one variant of GLRCUCB, and one oracle algorithm. Specifically, DUCB (kocsis2006discounted) and MUCB (cao2019nearly) are selected from piecewicestationary MAB literature; CUCB (chen2013combinatorial), CTS (wang2018thompson), and Hybrid (zimmert2019beating) are selected from stochastic combinatorial semibandit literature. The variant of GLRCUCB, termed LRGLRCUCB, uses different restart strategy. Instead of restarting the estimation of all bases arms once a changepoint is detected, LRGLRCUCB uses local restart strategy (only restarts the estimation of the base arms that are detected to have changes in reward distributions). For the oracle algorithm, termed OracleCUCB, we assume the algorithm knows when the optimal super arm changes and restarts CUCB at these changepoints. Note that this is stronger than knowing the changepoints, since change in distribution does not imply change in optimal super arm. Experiments are conducted on both synthetic and realworld dataset for the set bandit problems, which aims to identify the arms with highest expected reward at each time step. Equivalently, the reward function is the summation of the expected rewards of base arms. Since DUCB and MUCB are originally designed for piecewisestationary MAB, to adapt them to the piecewisestationary CMAB setting, we treat every super arm as a single arm when we run these two algorithms. Reward distributions of base arms along time are postponed to Appendix C.1. The details about parameter tuning for all of these algorithms for different experiments are included in Appendix C.2.
6.1 Synthetic Dataset
In this case we design a synthetic piecewisestationary combinatorial semibandit instance as follows:

Each base arm follows Bernoulli distribution.

Only one base arm changes its distribution between two consecutive piecewisestationary segments.

Every piecewisestationary segment is of equal length.
We let , , , and . The average regret of all algorithms are summarized in Figure 1.
Note that the optimal super arm does not change for the last three piecewisestationary segments. Observe that the welltuned GLR changepoint detector is insensitive to change with small magnitude, which implicitly avoids unnecessary and costly global restart, since small change is less likely to affect the optimal super arm. Surprisingly, GLRCUCB and LRGLRCUCB perform nearly as well as OracleCUCB and significantly better than other algorithms in regrets. In general, algorithms designed for stochastic CMAB outperform algorithms designed for piecewisestationary MAB. The reason is when the horizon is small, the dimension of the action space dominates the regret, and this effect becomes more obvious when is larger. Although orderwise, the cost incurred by the changepoint detection is much higher than the cost incurred by exploration.
Note that our experiment on this synthetic dataset does not satisfy Assumption 4.1. For example, the gap between the first segment and second segment is , and we choose and for GLRCUCB, which means the length of the second segment should be at least 9874. However, the actual length of the second segment is only 1000. Thus our algorithm performs very well compared to other algorithms even if Assumption 4.1 is violated. If Assumption 4.1 is satisfied, GLRCUCB can only perform better since it is easier to detect the change in distribution.
6.2 Yahoo! Dataset
We adopt the benchmark dataset for the realworld evaluation of bandit algorithms from Yahoo!^{1}^{1}1Yahoo! Front Page Today Module User Click Log Dataset on https://webscope.sandbox.yahoo.com. This dataset contains user click log for news articles displayed in the Featured Tab of the Today Module (li2011unbiased). Every base arm corresponds to the click rate of one article. Upon arrival of a user, our goal is to maximize the expected number of clicked articles by presenting out of articles to the users.
Yahoo! Experiment 1 (, , ).
We preprocess the dataset following cao2019nearly. To make the experiment nontrivial, we modify the dataset by: 1) the click rate of each base arm is enlarged by times; 2) Reducing the time horizon to . Results are in Figure 2.
Yahoo! experiment 1 is much harder than the synthetic problem, since it is much more nonstationary. Our experiments show GLRCUCB still significantly outperforms other algorithms and only has a small gap with respect to OracleCUCB. Again, Assumption 4.1 does not hold for these two instances, thus we believe it is fair to compare GLRCUCB with other algorithms. Unexpectedly, LRGLRCUCB performs even better than oracleCUCB, which suggests there is still much to exploit in the piecewisestationary bandits, since global restart has inferior performance in some cases, especially when the change in distribution is not significant. Additional experiments on Yahoo! dataset can be found in Appendix C.1.
7 Conclusion and Future Work
We have developed the first efficient and general algorithm for piecewisestationary CMAB, termed GLRUCB, which extends CUCB (chen2013combinatorial), by incorporating a GLR changepoint detector. We analyze the regret upper bound of GLRCUCB on the order of , and prove the minimax lower bound for piecewisestationary MAB and CMAB on the order of , which shows our algorithm is nearly orderoptimal within polylogarithm factors. Experimental results show our proposed algorithm outperforms other stateofthe art algorithms.
Future work includes designing algorithms for piecewisestationary CMAB with better restart strategy. GLRCUCB restarts whenever the GLR changepoint detector declares the reward distribution of one base arm changes, but this restart is very likely unnecessary, because changepoint with small magnitude might not change the optimal superarm. Another very challenging unsolved problem is whether one can close the gap between the regret upper bound and the minimax regret lower bound. Specifically, develop algorithm which is orderoptimal for piecewisestationary CMAB.
References
Appendices
Appendix A Detailed Proofs of Theorem 4.2
a.1 Proof of Lemma 4.4
Proof.
Define as a counter for each base arm at time (note that it is different from the counter defined in the algorithm) and update the counters in each round as follows: (1) After the initialization rounds, set . (2) For a round , if is bad, then increase by one, where .
By definition, the total number of bad rounds at time is no more than . Throughout the proof, we will use to denote the total number of times arm is played by the agent till time , to emphasize the time dependency in order to make the proof more readable.
Note that when a bad super arm is played, it incurs loss at most . Thus,
Thus it suffices to upper bound to upper bound the cumulative regret. Let , we have
The next step is to show . Let be the number of times arm is played in the first rounds, be the empirical mean of samples of th arm at the first piecewisestationary segment, be the UCB index of th arm at time , and be the actual mean of th arm during the first piecewisestationary segment. For any , by applying the Hoeffding’s inequality, we have
Define the event . By the union bound we have . However, we can show that . In other words, these two events are mutually exclusive. The reason is if both of these events hold, we have
where is the mean vector of the base arms at the first piecewisestationary segment. As for these above inequalities, (a) holds since ; (b) holds by the Lipschitz property of and . Note that the upper bound of the norm of vector difference comes from the fact that , and ; (c) holds by the definition of approximation oracle; (d) holds by the monotone property of and . However this contradicts the fact that , since , which implies that these two events are mutually exclusive. Thus,
To sum up,
The proof is done. ∎
a.2 Proof of Lemma 4.5
Proof.
Define as the first changepoint detection time of the th base arm, and then as GLR_CUCB restarts the whole algorithm if changepoint is detected on any of the base arms. Applying the union bound to the false alarm probability , we have that
Recall the GLR statistic defined in Eq. (1), and substitute it into ,
where is the mean of the rewards generated from the distribution with expected reward from time step to as only stationary scenario is considered here. Here, inequality (a) is because of the fact that ; inequality (b) holds since:
inequality (c) is because of the union bound; inequality (d) is according to the Lemma 10 in besson2019generalized; inequality (e) holds due to the Riemann zeta function , when , . Summing over , we conclude that . ∎
a.3 Proof of Corollary 4.7
Proof.
By the definition of , it follows directly that the conditional expected detection delay is upper bounded by . ∎
a.4 Proof of Theorem 4.2
Proof.
Define the good event and good event , . Recall the definition of the good event that all the changepoints up to th one have been detected successfully and efficiently in Eq. (2), and we can find that is the intersection of the event sequence of and up to the th changepoint. By first decomposing the expected approximation cumulative regret with respect to the event , we have that