Hedging the Drift: Learning to Optimize under NonStationarity
Wang Chi Cheung
Department of Industrial Systems Engineering and Management, National University of Singapore isecwc@nus.edu.sg
David SimchiLevi
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139, dslevi@mit.edu
Ruihao Zhu
Statistics and Data Science Center, Massachusetts Institute of Technology, Cambridge, MA 02139, rzhu@mit.edu
We introduce general datadriven decisionmaking algorithms that achieve stateoftheart dynamic regret bounds for nonstationary bandit settings. It captures applications such as advertisement allocation and dynamic pricing in changing environments. We show how the difficulty posed by the (unknown a priori and possibly adversarial) nonstationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Our main contribution is a general algorithmic recipe that first converts the rateoptimal UpperConfidenceBound (UCB) algorithm for stationary bandit settings into a tuned Sliding WindowUCB algorithm with optimal dynamic regret for the corresponding nonstationary counterpart. Boosted by the novel banditoverbandit framework with automatic adaptation to the unknown changing environment, it even permits us to enjoy, in a (surprisingly) parameterfree manner, this optimal dynamic regret if the amount of nonstationarity is moderate to large or an improved (compared to existing literature) dynamic regret otherwise. In addition to the classical explorationexploitation tradeoff, our algorithms leverage the power of the “forgetting principle” in their online learning processes, which is vital in changing environments. We further conduct extensive numerical experiments on both synthetic data and the CPRM12001: OnLine Auto Lending dataset provided by the Center for Pricing and Revenue Management at Columbia University to show that our proposed algorithms achieve superior dynamic regret performances.
Key words: datadriven decisionmaking, nonstationary bandit optimization, parameterfree algorithm
Consider the following general decisionmaking framework: a decisionmaker (DM) interacts with an environment by picking actions one at a time sequentially. Upon selecting an action, it instantly receives a reward drawn randomly from a probability distribution tied to this action. The goal of the DM is to maximize its cumulative rewards. However, it also faces the following challenges:

Uncertainty: the reward distribution of each action is initially unknown to the DM, and it thus has to estimate the underlying distributions via interacting with the environment.

NonStationarity: moreover, the environment is nonstationary, and the reward distributions can evolve over time.

Partial/Bandit Feedback: finally, the DM can only observe the random reward of the selected action each time; while the rewards of the unchosen actions remain unknown.
Evidently, the DM faces a trilemma among exploration, exploitation as well as adaptation to changes. On one hand, the DM wishes to exploit, and to selects the action with the best historical performances to earn as much rewards as possible; while on the other, it also wants to explore other actions to get a more accurate estimation of the reward distributions. The changing environment makes the explorationexploitation tradeoff even more delicate. Indeed, past observations could become obsolete due to the changes in the environment, and the DM needs to explore for changes and refrain from exploiting possibly outdated observations. It turns out that many applications fall naturally into this general framework:

an online platform allocates advertisements (ads) to a sequence of users. For each arriving user, the platform has to deliver an ad to her, and only observe the users’ responses to the displayed ads. The platform has full access to the features of the ads and the users. Following (Agrawal and Goyal 2013), we assume that a user’s click behavior towards an ad, or simply the click through rate (CTR) of this ad by a particular user, follows a probability distribution governed by a common, but initially unknown response function that is linear in the features of the ad and the user. The platform’s goal is to maximize the total profit. However, the unknown response function can change over time. For instance, if it is around the time that Apple releases a new iPhone, one can expect that the popularity of an Apple’s ad grows.

a seller decides the (personalized) price dynamically (Keskin and Zeevi 2014, 2016, Besbes and Zeevi 2015, Ban and Keskin 2018) for each of the incoming customers with the hope to maximize sales profit. Begin with an unknown demand function, the DM only observes the purchase decision of a customer under the posted price, but not any other price. In addition, the customers’ reaction towards the same price can vary across time due to the product reviews, the emergence of competitive products, etc.
Our framework is closely related to the multiarmed bandit (MAB) problems. MAB problems are online problems with partial feedback, when the DM is subject to uncertainty in his/her learning environment. Traditionally, most MAB problems are studied in the stochastic (Auer et al. 2002b) and adversarial (Auer et al. 2002a) environments. In the former, the model uncertainty is static and the partial feedback is corrupted by a mean zero random noise. The DM aims at estimating the latent static environment using historical data, and converging to a static optimal decision. In the latter, the model is dynamically changed by an adversary. The DM strives to hedge against the changes, and compete favorably in comparison to certain benchmark policies.
Unfortunately, strategies for the stochastic environments can quickly deteriorate under nonstationarity as historical data might “expire”; while the permission of an confronting adversary in the adversarial settings could be too pessimistic. Recently, a stream of research works (see Related Works) focuses on MAB problems in a drifting environment, which is a hybrid of a stochastic and an adversarial environment. Although the environment can be dynamically and adversarially changed, the total change (quantified by a suitable metric) in a step problem is upper bounded by , the variation budget. The feedback is corrupted by a mean zero random noise. The aim is to minimize the dynamic regret, which is the optimality gap compared to the sequence of (possibly dynamically changing) optimal decisions, by simultaneously estimating the current environment and hedging against future changes every time step. Most of the existing works for nonstationary bandits have focused on the the somewhat ideal case in which is known. In practice, however, is often not available ahead. Though some efforts have been made towards this direction (Karnin and Anava 2016, Luo et al. 2018), how to design algorithms with low dynamic regret when is unknown remains largely as a challenging problem.
In this paper, we design and analyze novel algorithms for bandit problems in a drifting environment. Our main contributions are listed as follows.

When the variation budget is known, we characterize the lower bound of dynamic regret, and develop a tuned Sliding Window UpperConfidenceBound (SWUCB) algorithm with matched dynamic regret upper bound up to logarithmic factors.

When is unknown, we propose a novel BanditoverBandit (BOB) framework that tunes SWUCB adaptively. When the amount of nonstationarity is moderate to large, the application of BOB on SWUCB algorithm recovers the optimal dynamic regret; otherwise, it obtains a dynamic regret bound with best dependence on compared to existing literature.

Our results are general in the sense that given any UpperConfidenceBound (UCB) algorithm with optimal regret for the stationary bandit settings, we are able to convert it into the corresponding SWUCB algorithm with optimal dynamic regret (when is known) and the BOB algorithm with nearly optimal dynamic regret (when is unknown), respectively.

Our algorithm design and analysis shed light on the fine balance between exploration, exploitation and adaptation to changes in dynamic learning environments. We rigorously incorporates the “forgetting principle” into the OptimisminFaceofUncertainty principle, by demonstrating that that the DM should dispose of observations which are sufficiently old. We provide precise criteria for the disposal, and rigorously show the convergence to optimality under these criteria.

Finally, we point out that a preliminary version of this paper will appear in the International Conference on Artificial Intelligence and Statistics (AISTATS 2019) (Cheung et al. 2019), and the current paper is a significantly extended version of it. Specifically, the current version provides a substantially refined analysis and thus a greatly improved regret bound for the BOB algorithm when compared to (Cheung et al. 2019). More importantly, we demonstrate the generality of the established results in the current paper, as exemplified by the extensions to the prevalent armed bandit and generalized linear bandit settings.
The rest of the paper is organized as follows. In Section id1, we review existing works in stationary and nonstationary environments. In Section id1, we formulate the nonstationary linear bandit model. We note that the choice of linear model is purely for the purpose of illustration, and all the results derived shall be applicable to other settings (Please see Section 4). In Section id1, we establish a minimax lower bound for our problem. In Section id1, we describe the sliding window estimator for parameter estimation under nonstationarity. In Section id1, we develop the sliding windowupper confidence bound algorithm with optimal dynamic regret (when the amount of nonstationarity is known ahead). In Section id1, we introduce the novel banditoverbandit framework with nearly optimal regret. In Section 4, we demonstrate the generality of the established results by applying them to other popular bandit settings, such as the multiarmed bandit and generalized linear bandit settings. In Section 10, we conduct extensive numerical experiments with both synthetic and CPRM12001: online auto lending datasets to show the superior empirical performances of our algorithms. In Section id1, we conclude our paper.
MAB problems with stochastic and adversarial environments are extensively studied, as surveyed in (Bubeck and CesaBianchi 2012, Lattimore and Szepesvári 2018). To model interdependence relationships among different arms, models for linear bandits in stochastic environments have been studied. In (Auer 2002, Dani et al. 2008, Rusmevichientong and Tsitsiklis 2010, Chu et al. 2011, AbbasiYadkori et al. 2011), UCB type algorithms for stochastic linear bandits were studied, and AbbasiYadkori et al. (AbbasiYadkori et al. 2011) possessed the stateoftheart algorithm for the problem. Thompson sampling algorithms proposed in (Russo and Roy 2014, Agrawal and Goyal 2013, Abeille and Lazaric 2017) are able to bypass the high computational complexities provided that one can efficiently sample from the posterior on the parameters and optimize the reward function accordingly. Unfortunately, achieving optimal regret bound via TS algorithms is possible only if the true prior over the reward vector is known.
(Besbes et al. 2014, 2018) considered the armed bandit in a drifting environment. They achieved the tight dynamic regret bound when is known. Wei et al. (Wei et al. 2016) provided refined regret bounds based on empirical variance estimation, assuming the knowledge of . Subsequently, Karnin et al. (Karnin and Anava 2016) considered the setting without knowing and , and achieved a dynamic regret bound of . In a recent work, (Luo et al. 2018) considered armed contextual bandits in drifting environments, and in particular demonstrated an improved bound for the armed bandit problem in drifting environments when is not known, among other results. (Keskin and Zeevi 2016) considered a dynamic pricing problem in a drifting environment with linear demands. Assuming a known variation budget they proved an dynamic regret lower bound and proposed a matching algorithm. When is not known, they designed an algorithm with dynamic regret. In (Besbes et al. 2015), a general problem of stochastic optimization under the known budgeted variation environment was studied. The authors presented various upper and lower bound in the full feedback settings. Finally, various online problems with full information feedback and drifting environments are studied in the literature (Chiang et al. 2012, Jadbabaie et al. 2015).
Apart from drifting environment, numerous research works consider the switching environment, where the time horizon is partitioned into at most intervals, and it switches from one stochastic environment to another across different intervals. The partition is not known to the DM. Algorithms are designed for various bandits, assuming a known (Auer et al. 2002a, Garivier and Moulines 2011, Luo et al. 2018), or assuming an unknown (Karnin and Anava 2016, Luo et al. 2018). Notably, the Sliding WindowUCB and the “forgetting principle” is first proposed by Garivier et al. (Garivier and Moulines 2011), while it is only analyzed under armed switching environments.
Finally, it is worth pointing out that our BanditoverBandit framework has connections with algorithms for online model selection and bandit corralling, see e.g., (Agarwal et al. 2017, Besbes et al. 2018) and references therein. This and similar techniques have been investigated under the context of nonstationary bandits in (Luo et al. 2018, Besbes et al. 2018). Notwithstanding, existing works either obtain suboptimal dynamic regret bounds or only empirical performance guarantees.
In this section, we introduce the notations that will be used throughout the discussions and the model formulation.
Throughout the paper, all vectors are column vectors, unless specified otherwise. We define to be the set for any positive integer The notation is the abbreviation of consecutive indexes We use to denote the Euclidean norm of a vector For a positive definite matrix , we use to denote the matrix norm of a vector We also denote and as the maximum and minimum between respectively. When logarithmic factors are omitted, we use to denote function growth.
In each round , a decision set is presented to the DM, and it has to choose an action Afterwards, the reward
is revealed. Here, we allow to be chosen by an oblivious adversary whose actions are independent of those of the DM, and can be determined before the protocol starts (CesaBianchi and Lugosi 2006). The vector of parameter is an unknown dimensional vector, and is a random noise drawn i.i.d. from an unknown subGaussian distribution with variance proxy . This implies , and we have
Following the convention of existing bandits literature (AbbasiYadkori et al. 2011, Agrawal and Goyal 2013), we assume there exist positive constants and such that and holds for all and all and the problem instance is normalized so that for all and
Instead of assuming the stochastic environment, where the reward function remains stationary across the time horizon, we allow it to change over time. Specifically, we consider the general drifting environment: the sum of differences of consecutive ’s should be bounded by some variation budget , i.e.,
(1) 
We again allow the ’s to be chosen adversarially by an oblivious adversary. We also denote the set of all possible obliviously selected sequences of ’s that satisfies inequality (1) as
The DM’s goal is to design a policy to maximize the cumulative reward, or equivalently to minimize the worst case cumulative regret against the optimal policy , that has full knowledge of ’s. Denoting the dynamic regret of a given policy is defined as
where the expectation is taken with respect to the (possible) randomness of the policy.
We first provide a lower bound on the the regret to characterize the best achievable regret.
Theorem 1
For any the dynamic regret of any policy satisfies
Proof.
Poof Sketch. The construction of the lower bound instance is similar to the approach of (Besbes et al. 2014): nature divides the whole time horizon into blocks of equal length rounds (the last block can possibly have less than rounds). In each block, the nature initiates a new stationary linear bandits instance with parameters from the set Nature also chooses the parameter for a block in a way that depends only on the DM’s policy, and the worst case regret is Since there is at least number of blocks, the total regret is By examining the variation budget constraint, we have that the smallest possible one can take is The statement then follows. Please refer to Section id1 for the complete proof. Q.E.D.
As a preliminary, we introduce the sliding window regularized least squares estimator, which is the key tool in estimating the unknown parameters . Despite the underlying nonstationarity, we show that the estimation error of this estimator can gracefully adapt to the parameter changes.
Consider a sliding window of length and consider the observation history during the time window . The ridge regression problem with regularization parameter is stated below:
(2) 
Denote as a solution to the regularized ridge regression problem, and define matrix . The solution has the following explicit expression:
(3) 
The difference has the following expression:
(4) 
The first term on the right hand side of eq. (4) is the estimation inaccuracy due to the nonstationarity; while the second term is the estimation error due to random noise. We now upper bound the two terms separately. We upper bound the first term in the sense.
Lemma 1
For any we have
Proof.
Poof Sketch.Our analysis relies on bounding the maximum eigenvalue of
for each . Please refer to Section id1 of the appendix for the complete proof. Q.E.D.
By applying (AbbasiYadkori et al. 2011), we upper bound the second term in the matrix norm sense.
Lemma 2 ((AbbasiYadkori et al. 2011))
For any and any we have
holds with probability at least
From now on, we shall denote
(5) 
for the ease of presentation. With these two lemmas, we have the following deviation inequality type bound for the latent expected reward of any action in any round
Theorem 2
For any and any , we have with probability at least
holds for all
Proof.
In this section, we describe the Sliding Window Upper Confidence Bound (SWUCB) algorithm. When the variation budget is known, we show that SWUCB algorithm with a tuned window size achieves a dynamic regret bound which is optimal up to a multiplicative logarithmic factor. When the variation budget is unknown, we show that SWUCB algorithm can still be implemented with a suitably chosen window size so that the regret dependency on is optimal, which still results in first order optimality in this case (Keskin and Zeevi 2016).
In the stochastic environment where the reward function is stationary, the well known UCB algorithm follows the principle of optimism in face of uncertainty (Auer et al. 2002b, AbbasiYadkori et al. 2011). Under this principle, the DM selects the action that maximizes the UCB, or the value of “mean plus confidence radius” (Auer et al. 2002b). We follow the principle by choosing in each round the action with the highest UCB, i.e.,
(6) 
When the number of actions is moderate, the optimization problem (6) can be solved by an enumeration over all Upon selecting we have
(7) 
by virtue of UCB. From Theorem 2, we further have with probability at least
(8) 
and
(9) 
Combining inequalities (7), (8), and (9), we establish the following high probability upper bound for the expected per round regret, i.e., with probability
(10) 
The regret upper bound of the SWUCB algorithm (to be formalized in Theorem 3) is thus
(11) 
If is known, the DM can set and achieve a regret upper bound If is not known, which is often the case in practice, the DM can set to obtain a regret upper bound
In this section, we describe the details of the SWUCB algorithm. Following its design guideline, the SWUCB algorithm selects a positive regularization parameter and initializes In each round the SWUCB algorithm first computes the estimate for according to eq. 3, and then finds the action with largest UCB by solving the optimization problem (6). Afterwards, the corresponding reward is observed. The pseudocode of the SWUCB algorithm is shown in Algorithm 1.
We are now ready to formally state a regret upper bound of the SWUCB algorithm.
Theorem 3
The dynamic regret of the SWUCB algorithm is upper bounded as
When is known, by taking the dynamic regret of the SWUCB algorithm is
When is unknown, by taking the dynamic regret of the SWUCB algorithm is
Proof.
Poof Sketch. The proof utilizes the fact that the per round regret of the SWUCB algorithm is upper bounded by the UCB of the chosen action, and decomposes the UCB into two separated terms according to Lemmas 1 and 2, i.e., regret in round is equal to
The first term can be upper bounded by a intuitive telescoping sum; while for the second term, although a similar quantity is analyzed by the authors of (AbbasiYadkori et al. 2011) using a (beautiful) matrix telescoping technique under the stationary environment, we note that due to the “forgetting principle” of the SWUCB algorithm, we cannot directly adopt the technique. Our proof thus makes a novel use of the ShermanMorrison formula to overcome the barrier. Please refer to Section id1 of the appendix for the complete proof. Q.E.D.
Remark 1
When the variation budget is known, Theorem 3 recommends choosing the length of the sliding window to be decreasing with . The recommendation is in agreement with the intuition that, when the learning environment becomes more volatile, the DM should focus on more recent observations. Indeed, if the underlying learning environment is changing at a higher rate, then the DM’s past observations become obsolete faster. Theorem 3 pins down the intuition of forgetting past observation in face of drifting environments, by providing the mathematical definition of the sliding window length that yields the optimal regret bound.
In Section id1, we have seen that, by properly tuning the DM can achieve a first order optimal regret bound even if the knowledge of is not available. However, in the case of an unknown and large , i.e., the bound becomes meaningless as it is linear in To handle this case, we wish to design an online algorithm that incurs a dynamic regret of order for some and , without knowing . Note from Theorem 1, no algorithm can achieve a dynamic regret of order , so we must have . In this section, we develop a novel BanditoverBandits (BOB) algorithm that achieves a regret of . Hence, (BOB) still has a dynamic regret sublinear in when for any and is not known, unlike the SWUCB algorithm.
Reviewing Theorem 3, we know that setting the window length to a fixed value
(12) 
can give us a regret bound. But when is not provided a priori, we need to also “learn” the unknown in order to properly tune In a more restrictive setting in which the differences between consecutive ’s follow some underlying stochastic process, one possible approach is applying a suitable machine learning technique to learn the underlying stochastic process at the beginning, and tune the parameter accordingly. In the more general setting, however, this strategy cannot work as the change between consecutive ’s can be arbitrary (or even adversarially) as long as the total variation is bounded by
The above mentioned observations as well as the established results motivate us to make use of the SWUCB algorithm as a subroutine, and “hedge” (Auer et al. 2002a, Audibert and Bubeck 2009) against the (possibly adversarial) changes of ’s to identify a reasonable fixed window length.
To this end, we describe the main idea of the BanditoverBandits (BOB) algorithm. As illustrated in Fig. 1, the BOB algorithm divides the whole time horizon into blocks of equal length rounds (the last block can possibly have less than rounds), and specifies a set from which each is drawn from. For each block , the BOB algorithm first selects a window length , and restarts the SWUCB algorithm with the selected window length as a subroutine to choose actions for this block. On top of this, the BOB algorithm also maintains a separate algorithm for adversarial multiarmed bandits, e.g., the EXP3 algorithm for adversarial multiarmed bandits against an oblivious adversary (Auer et al. 2002a, Audibert and Bubeck 2009), to govern the selection of window length for each block, and thus the name BanditoverBandits. Here, the total reward of each block is used as feedback for the EXP3 algorithm. It is worth emphasizing that

For validity of the EXP3 algorithm, the SWUCB algorithm for each block does not use any data collected in previous blocks except for the choice of window length (please see Remark 2 for a more involved discussion).

Due to the design of restarting, any instance of the SWUCB algorithm cannot last for more than rounds. As a consequence, even if the EXP3 selects a window length for some block the effective window length is It is thus reasonable to enforce that is a subset of i.e.,
To determine and , we first consider the regret of the BOB algorithm. As mentioned above, since is not necessarily attainable, i.e., by definition in eq. (12), might be larger than when is small, we hence denote the optimally (over ) tuned window length as By design of the BOB algorithm, its regret can be decomposed as the regret of the SWUCB algorithm with the optimally tuned window length for each block plus the loss due to learning the value with the EXP3 algorithm, i.e.,
(13) 
Here, eq. (\theequation@IDy) holds as the BOB algorithm restarts the SWUCB algorithm in each block, and for a round in block refers to the action selected in round by the SWUCB algorithm with window length initiated at the beginning of block
By Theorem 3, the first expectation in eq. (\theequation@IDy) can be upper bounded as
(14) 
where
is the total variation in block
We then turn to the second expectation in eq. (\theequation@IDy). We can easily see that the number of rounds for the EXP3 algorithm is and the number of possible values of ’s is Denoting the maximum absolute sum of rewards of any block as random variable the authors of (Auer et al. 2002a) gives the following regret bound.
(15) 
To proceed, we have to give a high probability upper bound for
Lemma 3
With probability at least does not exceed i.e.,
Proof.
Proof Sketch. The proof makes use of the subGaussian property of the noise terms as well as the union bound. Please refer to Section id1 of the for the complete proof. Q.E.D.
Note that the regret of our problem is at most eq. (15) can be further upper bounded as
(16) 
Combining eq. (\theequation@IDy), (\theequation@IDac), and (\theequation@IDai), the regret of the BOB algorithm is
(17) 
Eq. (17) exhibits a similar structure to the regret of the SWUCB algorithm as stated in Theorem 3, and this immediately indicates a clear tradeoff in the design of the block length

On one hand, should be small to control the regret incurred by the EXP3 algorithm in identifying i.e., the third term in eq. (17).

On the others, should also be large enough to allow to get close to so that the sum of the first two terms in eq. (17) is minimized.
A more careful inspection also reveals the tension in the design of Obviously, we hope that is small to minimize the third term in eq. (17), but we also wish to be dense enough so that it forms a cover to the set Otherwise, even if is large enough that can approach approximating with any element in can cause a major loss.
These observations suggest the following choice of
(18) 
for some positive integer and since the choice of should not depend on we can set with some and to be determined. We then distinguish two cases depending on whether is smaller than or not (or alternatively, whether is larger than or not).
Under this situation, can automatically adapt to the nearly optimal window length , where finds the largest element in that does not exceed Notice that the regret of the BOB algorithm then becomes
(19) 
Under this situation, equals to which is the window length closest to the regret of the BOB algorithm then becomes
(20) 
where we have make use of the fact that
Now both eq. (19) and eq. (20) suggests that we should set and eq. (20) further reveals that we should take and Plugging these choices of parameters back to case 1 and eq. (19), we have when the dynamic regret of the BOB algorithm is upper bounded as
(21) 
while if (or case 2), the dynamic regret of the BOB algorithm is upper bounded as
(22) 
according to eq. (20). Here we have to emphasize that the choice of and are purely for the purpose of analysis, while the only parameters that we need to decide are
(23) 
which clearly do not depend on
We are now ready to describe the details of the BOB algorithm. With and defined as eq. (23), the BOB algorithm additionally initiates the parameter
(24) 
for the EXP3 algorithm (Auer et al. 2002a). The BOB algorithm then divides the time horizon into blocks of length rounds (except for the last block, which can be less than rounds). At the beginning of each block the BOB algorithm first sets
(25) 
and then sets The selected window length is thus Afterwards, the BOB algorithm selects actions by running the SWUCB algorithm with window length for each round in block and the total collected reward is
Finally, the rewards are rescaled by dividing and then added by so that it lies within with high probability, and the parameter is set to
(26) 
while is the same as for all The pseudocode of the BOB algorithm is shown in Algorithm 2.
We are now ready to present the regret analysis of the BOB algorithm.
Theorem 4
The dynamic regret of the BOB algorithm is upper bounded as follows.

When

When
Proof.
Proof Sketch. The proof of the theorem essentially follows Section id1, and please refer to Section id1 of the appendix for the complete proof. Q.E.D.
Remark 2
The block structure and restarting the SWUCB algorithm with a single window length for each block are essential for the correctness of the BOB algorithm. Otherwise, suppose the DM utilizes the EXP3 algorithm to select the window length for each round and implements the SWUCB algorithm with the selected window length without ever restarting it. Instead of eq. (\theequation@IDy), the regret of the BOB algorithm is then decomposed as
(27) 
Here, with some abuse of notations, refers to in round the DM runs the SWUCB algorithm with window length and historical data, e.g., (action, reward) pairs, generated by running the SWUCB algorithm with window length for rounds Same as before, the second term of eq. (2) can be upper bounded as a result of Theorem 3. It is also tempting to apply results from the EXP3 algorithm to upper bound the first term. Unfortunately, this is incorrect as it is required by the adversarial bandits protocol (Auer et al. 2002a) that the DM and its competitor should receive the same reward if they select the same action, i.e., the reward of in round and the reward of in round should be the same for every Nevertheless, this is violated as running the SWUCB algorithm with different window length for previous rounds can generate different (action,reward) pairs, and this results in possibly different estimated ’s for the two SWUCB algorithms even if both of them use the same window length in round Hence, the selected actions and the corresponding rewards by these two instances might also be different. By the careful design of blocks as well as the restarting scheme, the BOB algorithm decouples the SWUCB algorithm for a block from previous blocks, and thus fixes the above mentioned problem, i.e., the regret of the BOB algorithm is decomposed as eq. (\theequation@IDy).
Remark 3
The assumptions that the decision sets ’s and the underlying parameters ’s are chosen by an oblivious adversary as well as the i.i.d. noise terms are important. With these, the EXP3 algorithm is also facing an obliviously adversarial environment, and thus satisfies the adversarial bandits protocol (Auer et al. 2002a, Audibert and Bubeck 2009). To see this, assuming additional access to all the i.i.d. (and thus independent of the DMs’ actions) noise terms in advance, the adversary can determine the total rewards of each block with respect to each window length in , independently of the EXP3 algorithm, by running the SWUCB algorithm with that window length as well as the obliviously chosen ’s and ’s,
Remark 4
The structure of the BOB algorithmcan be roughly seen as using a EXP3 algorithm to govern the behaviors of many copies of SWUCB algorithms with different window lengths. As mentioned in Section id1, it has a flavor similar to the technique of bandits corralling/aggregation, see e.g., (Agarwal et al. 2017, Besbes et al. 2018) and references therein. Existing works, such as (Luo et al. 2018, Besbes et al. 2018), have tried to apply bandits corralling/aggregation to the nonstationary bandits settings, but when is unknown, they can only obtain either suboptimal dynamic regret bounds (Luo et al. 2018) or empirical performance guarantees (Besbes et al. 2018).
In this section, we demonstrate the generality of our established results. As illustrative examples, we apply the sliding window UCB algorithm as well as the banditoverbandit framework to the prevalent multiarmed bandits (Auer et al. 2002b) and generalized linear bandit settings. In what follows, we shall derive the SWUCB algorithm as well as the parameters required by the BOB algorithm, i.e., similar to those defined in eq. (23), for both cases. We note that same steps can be applied to other settings, such as Gaussian process bandits (Srinivas et al. 2010) and combinatorial bandits with sembandit feedback (Kveton et al. 2015), to derive similar results.
In the armed bandits setting, every action set is comprised of actions The action has coordinate equals to 1 and all other coordinates equal to Therefore, the reward of choosing action in round is
where is the coordinate of For a window length we also define as the number of times that action is chosen within rounds.