A Near-Optimal Change-Detection Based Algorithm for Piecewise-Stationary Combinatorial Semi-Bandits
We investigate the piecewise-stationary combinatorial semi-bandit problem. Compared to the original combinatorial semi-bandit problem, our setting assumes the reward distributions of base arms may change in a piecewise-stationary manner at unknown time steps. We propose an algorithm, GLR-CUCB, which incorporates an efficient combinatorial semi-bandit algorithm, CUCB, with an almost parameter-free change-point detector, the Generalized Likelihood Ratio Test (GLRT). Our analysis shows that the regret of GLR-CUCB is upper bounded by , where is the number of piecewise-stationary segments, is the number of base arms, and is the number of time steps. As a complement, we also derive a nearly matching regret lower bound on the order of ), for both piecewise-stationary multi-armed bandits and combinatorial semi-bandits, using information-theoretic techniques and judiciously constructed piecewise-stationary bandit instances. Our lower bound is tighter than the best available regret lower bound, which is . Numerical experiments on both synthetic and real-world datasets demonstrate the superiority of GLR-CUCB compared to other state-of-the-art algorithms.
The multi-armed bandit (MAB) problem, first proposed by thompson1933likelihood, has been studied extensively in statistics and machine learning communities, as it models many online decision making problems such as online recommendation (li2016collaborative), computational advertising (tang2013automatic), and crowdsourcing task allocation (hassan2014multi). The classical MAB is modeled as an agent repeatedly pulling one of arms and observing the reward generated, with the goal of minimizing the regret which is the difference between the reward of the optimal arm in hindsight and the reward of the arm chosen by the agent. This classical problem is well understood for both stochastic (lai1985asymptotically) and adversarial settings (auer2002nonstochastic). The stochastic setting is when the reward of each arm is generated from a fixed distribution, and it is well known that the problem-dependent lower bound is of order (lai1985asymptotically), where is the number of time steps. Several algorithms have been proposed and proven to achieve regret (agrawal2012analysis; auer2002finite). The adversarial setting is when at each time step the environment generates the reward in an adversarial manner, whose minimax regret lower bound is of order . Several algorithms achieving order-optimal (up to poly-logarithm factors) regret have been proposed in recent years (hazan2011better; bubeck2018sparsity; li2019bandit).
Many real-world applications, however, have a combinatorial nature that cannot be fully characterized by the classical MAB model. For example, online movie sites aim to recommend multiple movies to the users to maximize their utility under some constraints (e.g. recommend at most one movie for each category). This phenomenon motivates the study of combinatorial semi-bandits (CMAB), which aims to identify the best superarm, a set of base arms with highest aggregated reward. Several algorithms for stochastic CMAB with provable guarantees based on optimism principle have been proposed recently (chen2013combinatorial; kveton2015tight; combes2015combinatorial). All of these algorithms use some oracle to overcome the curse of dimension of the action space for solving some combinatorial optimization problem at each iteration. In addition, these algorithms achieve the optimal regret upper bound which is , where is an instance-dependent parameter. Adversarial CMAB is also well studied. Several algorithms achieve optimal regret of order based on either Follow-the-Regularized-Leader or Follow-the-Perturbed-Leader, both of which are general frameworks for adversarial online learning algorithm design (hazan2016introduction). Moreover, one recent study has developed an algorithm that is order-optimal for both stochastic and adversarial CMAB (zimmert2019beating).
Although both stochastic and adversarial CMAB are well-studied, understanding of the scenario lying in the “middle” of these two settings is still limited. Such a “middle” setting where the reward distributions of base arms slowly change over time may be a more realistic model in many applications. For instance, in online recommendation systems, users’ preference are unlikely to be either time-invariant or to change significantly and frequently over time. Thus in this case it would be too ideal to assume the stochastic CMAB model and too conservative to assume the adversarial CMAB model. Similar situations appear in web search, online advertisement, and crowdsourcing (yu2009piecewise; pereira2018analyzing; vempaty2014reliable). As such, we investigate a setting lying between these two standard CMAB models, namely piecewise-stationary combinatorial semi-bandit, which we will define formally in Section 2. Piecewise-stationary CMAB is a natural generalization of the piecewise-stationary MAB model (hartland2007change; kocsis2006discounted; garivier2011upper), and can be interpreted as an approximation to the slow-varying CMAB problem. Roughly, compared to the stochastic CMAB, we assume reward distributions of base arms remain fixed for certain time periods called piecewise-stationary segments, but can change abruptly at some unknown time steps, called change-points.
Previous works on piecewise-stationary MAB may be divided into two categories: passively adaptive approach (garivier2011upper; besbes2014stochastic; wei2018abruptly) and actively adaptive approach (cao2019nearly; liu2018change; besson2019generalized; auer2019adaptively). Passively adaptive approaches make decisions based on the most recent observations and are unaware of the underlying distribution changes. On the contrary, actively adaptive approaches incorporate a change-point detector subroutine to monitor the reward distributions, and restart the algorithm once a change-point is detected. Numerous empirical experiments have shown that actively adaptive approaches outperform passively adaptive approaches (mellor2013thompson), which motivates us to adopt an actively adaptive approach.
Our main contributions include the following:
We propose a simple and general algorithm for piecewise-stationary CMAB, named GLR-CUCB, which is based on CUCB (chen2013combinatorial) with a novel change-point detector, the generalized likelihood ratio test (GLRT) (besson2019generalized). The advantage of GLRT change-point detection is it is almost parameter-free and thus easy to tune compared to previously proposed change-point detection methods used in nonstationary MAB, such as CUSUM (liu2018change) and SW (cao2019nearly).
For any combinatorial action set, we derive the problem-dependent regret bound for GLR-CUCB under mild conditions (see Section 4). When the number of change points is known beforehand, the regret of GLR-CUCB is upper bounded by (nearly order-optimal within poly-logarithm factor in ), where is number of base arms. When is unknown, the algorithm achieves . Here, and are problem-dependent constants which do not depend on , , or .
We derive a tighter minimax lower bound for both piecewise-stationary MAB and piecewise-stationary CMAB on the order of . Since piecewise-stationary MAB is a special instance of piecewise-stationary CMAB in which every superarm is a single arm, thus any minimax lower bound holds for piecewise-stationary MAB also holds for piecewise-stationary CMAB. To the best of our knowledge, this is the best existing minimax lower bound for piecewise-stationary CMAB. Previously, the best available lower bound is (garivier2011upper), which does not depend on or .
We demonstrate that GLR-CUCB performs significantly better than state-of-the-art algorithms through experiments on both synthetic and real-world datasets.
The remainder of this paper is organized as follows: the formal problem formulation and some preliminaries are introduced in Section 2, then the proposed GLR-CUCB algorithm in Section 3. We derive the upper bound on the regret of our algorithm in Section 4, and the minimax regret lower bound in Section 5. Section 6 gives our experiment results. Finally, we conclude the paper. Due to the page limitation, we postpone proofs and additional experimental results to the appendix.
2 Problem Formulation and Background
In this section, we start with the formal definition of piecewise-stationary combinatorial semi-bandit as well as some technical assumptions in Section 2.1. Then, we introduce the GLR change-point detector used in our algorithm design, and its advantage in Section 2.2.
2.1 Piecewise-Stationary Combinatorial Semi-Bandits
A piecewise-stationary combinatorial semi-bandit is characterized by a tuple . Here, is the set of base arms; is the set of all super arms; is a sequence of time steps; is the reward distribution of arm at time with mean and bounded support within ; is the expected reward function defined on the super arm and mean vector of all base arms at time . Like chen2013combinatorial, we assume the expected reward function satisfies the following two properties:
Assumption 2.1 (Monotonicity).
Given two arbitrary mean vectors and , if , then .
Assumption 2.2 (-Lipschitz).
Given two arbitrary mean vectors and , there exists an such that , , where is the projection operator specified as in terms of the indicator function .
In the piecewise i.i.d. model, we define , the number of piecewise-stationary segments in the reward process, to be
We denote these change-points as respectively, and we let and . For each piecewise-stationary segment , we use and to denote the reward distribution and the expected reward of arm on the th piecewise-stationary segment, respectively. The vector encoding the expected rewards of all base arms at the th segment is denoted as , . Note that when a change-point occurs, there must be at least one arm whose reward distribution has changed, however, the rewards distributions of all base arms do not necessarily change.
For a piecewice-stationary combinatorial semi-bandit problem, at each time step , the learning agent chooses a super arm to play based on the rewards observed up to time . When the agent plays a super arm , the reward of base arms contained in super arm are revealed to the agent and the reward of super arm as well. We assume that the agent has access to an -approximation oracle, to carry out combinatorial optimization, defined as follows.
Assumption 2.3 (-approximation oracle).
Given a mean vector , the -approximation oracle outputs an -suboptimal super arm such that .
The approximation oracle assumption was first proposed in chen2013combinatorial for the combinatorial semi-bandit setting. This assumption is reasonable since many combinatorial NP-hard problems admit approximation algorithms, which can be solved efficiently in polynomial time (ausiello1995approximate). There are also many combinatorial problems which are not NP hard and can be solved efficiently. One example is the top- arm identification problem in the bandit setting (cao2015top), where any efficient sorting algorithm suffices.
As only an -approximation oracle is used for optimization, it is reasonable to use expected -approximation cumulative regret to measure the performance of the learning agent, defined as follows.
Definition 2.4 (Expected -approximation cumulative regret).
The agent’s policy is evaluated by its expected -approximation cumulative regret,
where the expectation is taken with respect to the selection of .
2.2 Generalized Likelihood Ratio Change-Point Detector for Sub-Bernoulli Distribution
Sequential change-point detection is a classical problem in statistical sequential analysis, but most existing works make additional assumptions on the pre-change and post-change distributions which might not hold in the bandit setting (siegmund2013sequential; basseville1993detection). In general, designing algorithms with provable guarantees for change detection with little assumption on pre-change and post-change distributions is very challenging. In our algorithm design, we will use the generalized likelihood ratio (GLR) change-point detector (besson2019generalized), which works for any sub-Bernoulli distribution. Compared to other existing change detection methods used in piecewise-stationary MAB, the GLR detector has less parameters to be tuned and needs less prior knowledge for the bandit instance. Specifically, GLR only needs to tune the threshold for the change-point detection, and does not require the smallest change in expectation among all change-points. On the contrary, CUSUM (liu2018change) and SW (cao2019nearly) both need more parameters to be tuned and need to know the smallest magnitude among all change-points beforehand, which limits their practicality.
To define GLR change-point detector, we need some more definitions for clarity. A distribution is said to be sub-Bernoulli if , where ; is the log moment generating function of a Bernoulli distribution with mean . Notice that the support of reward distribution , is a subset of the interval , thus all are sub-Bernoulli distributions with mean , due to the following lemma.
Lemma 2.5 (Lemma 1 in cappe2013kullback).
Any distribution with bounded support within the interval is a sub-Bernoulli distribution that satisfies:
Suppose we have either a time sequence drawn from a sub-Bernoulli distribution for any or two sub-Bernoulli distributions with an unknown change-point . This problem can be formulated as a parametric sequential test:
The GLR statistic for sub-Bernoulli distributions is:
where is the mean of the observations collected between and , and is the binary relative entropy between Bernoulli distributions,
If the GLR in Eq. (1) is large, it indicates that hypothesis is more likely. Now, we are ready to define the sub-Bernoulli GLR change-point detector with confidence level .
The sub-Bernoulli GLR change-point detector with threshold function is
where , and is as in Eq. (13) in kaufmann2018mixture.
The pseudo-code of sub-Bernoulli GLR change-point detector is summarized in Algorithm 1 for completeness.
3 The Glr-Cucb Algorithm
Our proposed algorithm, GLR-CUCB, incorporates an efficient combinatorial semi-bandit problem algorithm CUCB (chen2013combinatorial) with a change-point detector running on each base arm (See Algorithm 2). The GLR-CUCB requires the number of time steps , the number of base arms , uniform exploration probability , and the confidence level as inputs. Let denote the last change-point detection time, and denote the number of observations of base arm after , which are initialized as zeros at the beginning of the algorithm.
At each time step, the GLR-CUCB first determines if it will enter forced uniform exploration (to ensure each base arm collects sufficient samples for the change-point detection) according to the condition in line 3. If it is in a forced exploration, a random super arm that contains (line 4) is played, to ensure sufficient number of samples are collected for every base arm. Otherwise, the next super arm to be played is determined by the -approximation oracle (line 6) given the UCB indices (line 19). Then, the learning agent plays the super arm , and gets the reward of the super arm and the rewards of the base arm ’s that are contained in the super arm (line 8). In the next step, the algorithm updates the statistics for each base arm (lines 10-11) in order to run the GLR change-point detector (Algorithm 1) with confidence level (line 12). If the GLR change-point detector detects a change in distribution for any of the base arms, the algorithm sets to be the current time step and all ’s to be 0 (line 13-14) before going into time step . Lastly, the UCB indices of all base arms are updated (line 19).
The uniform exploration is necessary for this algorithm, and similar strategy has been adopted in liu2018change; cao2019nearly. Intuitively, uniform exploration ensures each base arm gathers sufficient samples to guarantee quick change detection whereas pure UCB exploration is incapable of this. One more rigorous argument is given in garivier2011upper, which shows that theoretically pure UCB exploration performs badly on piecewise-stationary MAB.
Thompson sampling (TS) often performs better than UCB policy in empirical simulations, but it has been shown that one cannot incorporate an approximate oracle in TS for even MAB problems (wang2018thompson). Thus our algorithm adopts UCB policy for the bandit component to ensure compatibility with approximation oracle.
4 Regret Upper Bound
In this section, we analyze the -step regret of our proposed algorithm GLR-CUCB. Recall is the time horizon, is the number of piecewise-stationary segments, are the change-points, and for each segment , is the vector encoding the expected rewards of all base arms. A super arm is bad with respect to the th piecewise-stationary distributions if . We define to be the set of bad super arms with respect to the th piecewise-stationary segment. We define the suboptimality gap in the th stationary segment as follows:
Furthermore, let and be the maximum and minimum sub-optimal gaps for the whole time horizon, respectively. Lastly, denote the largest gap at change-point as , , and . We need the following assumption for our theoretical analysis.
Define and assume , , where .
Tuning and properly (See Corollary 4.3), and applying the upper bound on by kaufmann2018mixture with ,
the length of each piecewise-stationary segment is . Roughly, we assume the length of each stationary segment to be sufficiently long, in order to let the GLR change-point detector detect the change in distribution within a reasonable delay with high probability. Similar assumption on the length of stationary segments also appears in other literature on piecewise stationary MAB (liu2018change; cao2019nearly; besson2019generalized). Note that Assumption 4.1 is only required for the theoretical analysis; Algorithm 2 can be implemented regardless of this assumption. Now we are ready to state the regret upper bound for Algorithm 2.
Theorem 4.2 indicates that the regret comes from four sources. Terms (a) and (b) correspond to the cost of exploration, while terms (c) and (d) correspond to the cost of change-point detection. More specifically, term (a) is due to UCB exploration, term (b) is due to uniform exploration, term (c) is due to expected delay of GLR change-point detector, and term (d) is due to the false alarm probability of GLR change-point detector. We need to carefully tune the exploration probability and false alarm probability to balance the trade-off.
The following corollary comes directly from Theorem 4.2 by properly tuning the parameters in the algorithm.
Let , we have
( is known) Choosing , , gives ;
( is unknown) Choosing , , gives .
The effect of oracle is to reduce the dependency on number of base arms from to during exploration (first term in the regret appeared in Corollary 4.3). In the worst case, can be exponential with respect to . Recall that if we use standard MAB algorithms for exploration, the dependency on is .
As becomes larger, the regret is dominated by the cost of change-point detection, which has similar order compared to the regret bound of piecewise-stationary MAB algorithms. This is reasonable since our setting assumes that we have access to the reward of the base arms contained in the super arm played by the agent.
When is large, the order of the regret bound is similar to the regret bound of adversarial bandit. But note that regret definition for adversarial bandit and piecewise-stationary bandit is different. For the first case, the regret is evaluated with respect to one fixed arm which is optimal for the whole horizon. But for the second case, the regret is evaluated with respect to point-wise optimal arm, which is much more challenging.
We can use Corollary 4.3 as a guide for parameter tuning. The above corollary indicates that without knowledge of the number of change-points , we pay a penalty of a factor of in the long run.
Under the stationary scenario, i.e. , the -approximation cumulative regret of GLR-CUCB is upper bounded as:
The first term is due to the possible false alarm of the change-point detection subroutine, the second term is due to the uniform exploration, and the last term is due to the UCB exploration. We upper bound the false alarm probability in Lemma 4.5, as follows.
Lemma 4.5 (False alarm probability in the stationary scenario).
Consider the stationary scenario, i.e. , with confidence level ; we have that
By setting , we will have . Asymptotically, the false alarm probability will go to 0.
In the next lemma, we show the GLR change-point detector is able to detect change in distribution reasonably well with high probability, given all previous change-points were detected reasonably well. The formal statement is as follows.
(Lemma 12 in besson2019generalized) Define the event that all the change-points up to th one have been detected successfully within a small delay:
Then, , and , where is the detection time of the th change-point.
Lemma 4.6 provides an upper bound for the conditional expected detection delay, given the good events .
Corollary 4.7 (Bounded conditional expected delay).
Given these lemmas, we can derive the regret upper bound for GLR-CUCB in a recursive manner. Specifically, we prove Theorem 4.2 by recursively decomposing the regret into a collection of good events and bad events. The good events contain all sample paths that GLR-CUCB reinitialize the UCB index of base arms after all change-points correctly within a small delay. On the other hand, the bad events contain all sample paths where either GLR change detector fails to detect the change in distribution or detects the change with a large delay. The cost incurred given the good events can be upper bounded by Lemma 4.4 and Lemma 4.7. By upper bounding the probabilities of bad events via Lemma 4.5 and Lemma 4.6, the cost incurred given the bad events is analyzable. Detailed proofs are presented in Appendix A.
5 Regret Lower Bound
The lower bound for MAB problems has been studied extensively. Previously, the best available minimax lower bound for piecewise-stationary MAB was by garivier2011upper. Note that piecewise-stationary MAB is a special instance for piecewise-stationary CMAB in which every super arm is a base arm, thus this lower bound still holds for piecewise-stationary CMAB. We derive a tighter lower bound by characterizing the dependency on and .
If and , then the worst-case regret for any policy is lower bounded by
where , .
(Sketch) The high level idea is to construct a randomized ‘hard’ instance which is appropriate to our setting (bubeck2012regret; besbes2014stochastic; lattimore2018bandit), then analyze its regret lower bound which holds for any exploration policy. The construction of this ‘hard’ instance is as follows.
We partition the time horizon into segments with equal length except for the last segment. In each segment, assume the rewards of all arms are Bernoulli distributions and stay unchanged. At each time step there is an optimal arm with expected reward of and the remaining arms have the same expected reward of . The optimal arm will change in two consecutive segments by sampling uniformly at random from the remaining arms.
We then use Lemma A.1 in auer2002nonstochastic to upper bound the expected number of pulls to any arm being optimal under change of distributions, from the sub-optimal reward distribution to the optimal reward distribution (Bern( to Bern()). Given the upper bound of expected number of pulls to the optimal arm, we can lower bound the regret for any exploration policy. By properly tuning and after some additional steps, we can derive the minimax regret lower bound. The condition comes from the fact that the lower bound needs to be non-trivial, and comes from the tuning of .
For the detailed proof, please refer to Appendix B. ∎
The conditions for this minimax lower bound are mild, since in practice the number of base arms is usually much larger and we care about the long-term regret, in other words, large regime.
Our minimax lower bound shows that GLR-CUCB is nearly order-optimal with respect to all parameters. On the other hand, as a byproduct, this bound also indicates that EXP3S (auer2002nonstochastic) and MUCB (cao2019nearly) are nearly order-optimal for piecewise-stationary MAB, up to poly-logarithm factors. To be more specific, EXP3S and MUCB achieve regret and respectively.
We compare GLR-CUCB with five baselines from the literature, one variant of GLR-CUCB, and one oracle algorithm. Specifically, DUCB (kocsis2006discounted) and MUCB (cao2019nearly) are selected from piecewice-stationary MAB literature; CUCB (chen2013combinatorial), CTS (wang2018thompson), and Hybrid (zimmert2019beating) are selected from stochastic combinatorial semi-bandit literature. The variant of GLR-CUCB, termed LR-GLR-CUCB, uses different restart strategy. Instead of restarting the estimation of all bases arms once a change-point is detected, LR-GLR-CUCB uses local restart strategy (only restarts the estimation of the base arms that are detected to have changes in reward distributions). For the oracle algorithm, termed Oracle-CUCB, we assume the algorithm knows when the optimal super arm changes and restarts CUCB at these change-points. Note that this is stronger than knowing the change-points, since change in distribution does not imply change in optimal super arm. Experiments are conducted on both synthetic and real-world dataset for the -set bandit problems, which aims to identify the arms with highest expected reward at each time step. Equivalently, the reward function is the summation of the expected rewards of base arms. Since DUCB and MUCB are originally designed for piecewise-stationary MAB, to adapt them to the piecewise-stationary CMAB setting, we treat every super arm as a single arm when we run these two algorithms. Reward distributions of base arms along time are postponed to Appendix C.1. The details about parameter tuning for all of these algorithms for different experiments are included in Appendix C.2.
6.1 Synthetic Dataset
In this case we design a synthetic piecewise-stationary combinatorial semi-bandit instance as follows:
Each base arm follows Bernoulli distribution.
Only one base arm changes its distribution between two consecutive piecewise-stationary segments.
Every piecewise-stationary segment is of equal length.
We let , , , and . The average regret of all algorithms are summarized in Figure 1.
Note that the optimal super arm does not change for the last three piecewise-stationary segments. Observe that the well-tuned GLR change-point detector is insensitive to change with small magnitude, which implicitly avoids unnecessary and costly global restart, since small change is less likely to affect the optimal super arm. Surprisingly, GLR-CUCB and LR-GLR-CUCB perform nearly as well as Oracle-CUCB and significantly better than other algorithms in regrets. In general, algorithms designed for stochastic CMAB outperform algorithms designed for piecewise-stationary MAB. The reason is when the horizon is small, the dimension of the action space dominates the regret, and this effect becomes more obvious when is larger. Although order-wise, the cost incurred by the change-point detection is much higher than the cost incurred by exploration.
Note that our experiment on this synthetic dataset does not satisfy Assumption 4.1. For example, the gap between the first segment and second segment is , and we choose and for GLR-CUCB, which means the length of the second segment should be at least 9874. However, the actual length of the second segment is only 1000. Thus our algorithm performs very well compared to other algorithms even if Assumption 4.1 is violated. If Assumption 4.1 is satisfied, GLR-CUCB can only perform better since it is easier to detect the change in distribution.
6.2 Yahoo! Dataset
We adopt the benchmark dataset for the real-world evaluation of bandit algorithms from Yahoo!111Yahoo! Front Page Today Module User Click Log Dataset on https://webscope.sandbox.yahoo.com. This dataset contains user click log for news articles displayed in the Featured Tab of the Today Module (li2011unbiased). Every base arm corresponds to the click rate of one article. Upon arrival of a user, our goal is to maximize the expected number of clicked articles by presenting out of articles to the users.
Yahoo! Experiment 1 (, , ).
We pre-process the dataset following cao2019nearly. To make the experiment nontrivial, we modify the dataset by: 1) the click rate of each base arm is enlarged by times; 2) Reducing the time horizon to . Results are in Figure 2.
Yahoo! experiment 1 is much harder than the synthetic problem, since it is much more non-stationary. Our experiments show GLR-CUCB still significantly outperforms other algorithms and only has a small gap with respect to Oracle-CUCB. Again, Assumption 4.1 does not hold for these two instances, thus we believe it is fair to compare GLR-CUCB with other algorithms. Unexpectedly, LR-GLR-CUCB performs even better than oracle-CUCB, which suggests there is still much to exploit in the piecewise-stationary bandits, since global restart has inferior performance in some cases, especially when the change in distribution is not significant. Additional experiments on Yahoo! dataset can be found in Appendix C.1.
7 Conclusion and Future Work
We have developed the first efficient and general algorithm for piecewise-stationary CMAB, termed GLR-UCB, which extends CUCB (chen2013combinatorial), by incorporating a GLR change-point detector. We analyze the regret upper bound of GLR-CUCB on the order of , and prove the minimax lower bound for piecewise-stationary MAB and CMAB on the order of , which shows our algorithm is nearly order-optimal within poly-logarithm factors. Experimental results show our proposed algorithm outperforms other state-of-the art algorithms.
Future work includes designing algorithms for piecewise-stationary CMAB with better restart strategy. GLR-CUCB restarts whenever the GLR change-point detector declares the reward distribution of one base arm changes, but this restart is very likely unnecessary, because change-point with small magnitude might not change the optimal superarm. Another very challenging unsolved problem is whether one can close the gap between the regret upper bound and the minimax regret lower bound. Specifically, develop algorithm which is order-optimal for piecewise-stationary CMAB.
Appendix A Detailed Proofs of Theorem 4.2
a.1 Proof of Lemma 4.4
Define as a counter for each base arm at time (note that it is different from the counter defined in the algorithm) and update the counters in each round as follows: (1) After the initialization rounds, set . (2) For a round , if is bad, then increase by one, where .
By definition, the total number of bad rounds at time is no more than . Throughout the proof, we will use to denote the total number of times arm is played by the agent till time , to emphasize the time dependency in order to make the proof more readable.
Note that when a bad super arm is played, it incurs loss at most . Thus,
Thus it suffices to upper bound to upper bound the cumulative regret. Let , we have
The next step is to show . Let be the number of times arm is played in the first rounds, be the empirical mean of samples of th arm at the first piecewise-stationary segment, be the UCB index of th arm at time , and be the actual mean of th arm during the first piecewise-stationary segment. For any , by applying the Hoeffding’s inequality, we have
Define the event . By the union bound we have . However, we can show that . In other words, these two events are mutually exclusive. The reason is if both of these events hold, we have
where is the mean vector of the base arms at the first piecewise-stationary segment. As for these above inequalities, (a) holds since ; (b) holds by the -Lipschitz property of and . Note that the upper bound of the norm of vector difference comes from the fact that , and ; (c) holds by the definition of -approximation oracle; (d) holds by the monotone property of and . However this contradicts the fact that , since , which implies that these two events are mutually exclusive. Thus,
To sum up,
The proof is done. ∎
a.2 Proof of Lemma 4.5
Define as the first change-point detection time of the th base arm, and then as GLR_CUCB restarts the whole algorithm if change-point is detected on any of the base arms. Applying the union bound to the false alarm probability , we have that
Recall the GLR statistic defined in Eq. (1), and substitute it into ,
where is the mean of the rewards generated from the distribution with expected reward from time step to as only stationary scenario is considered here. Here, inequality (a) is because of the fact that ; inequality (b) holds since:
inequality (c) is because of the union bound; inequality (d) is according to the Lemma 10 in besson2019generalized; inequality (e) holds due to the Riemann zeta function , when , . Summing over , we conclude that . ∎
a.3 Proof of Corollary 4.7
By the definition of , it follows directly that the conditional expected detection delay is upper bounded by . ∎
a.4 Proof of Theorem 4.2
Define the good event and good event , . Recall the definition of the good event that all the change-points up to th one have been detected successfully and efficiently in Eq. (2), and we can find that is the intersection of the event sequence of and up to the th change-point. By first decomposing the expected -approximation cumulative regret with respect to the event , we have that