Hedging the Drift: Learning to Optimize under Non-Stationarity
Wang Chi Cheung
Department of Industrial Systems Engineering and Management, National University of Singapore firstname.lastname@example.org
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139, email@example.com
Statistics and Data Science Center, Massachusetts Institute of Technology, Cambridge, MA 02139, firstname.lastname@example.org
We introduce general data-driven decision-making algorithms that achieve state-of-the-art dynamic regret bounds for non-stationary bandit settings. It captures applications such as advertisement allocation and dynamic pricing in changing environments. We show how the difficulty posed by the (unknown a priori and possibly adversarial) non-stationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Our main contribution is a general algorithmic recipe that first converts the rate-optimal Upper-Confidence-Bound (UCB) algorithm for stationary bandit settings into a tuned Sliding Window-UCB algorithm with optimal dynamic regret for the corresponding non-stationary counterpart. Boosted by the novel bandit-over-bandit framework with automatic adaptation to the unknown changing environment, it even permits us to enjoy, in a (surprisingly) parameter-free manner, this optimal dynamic regret if the amount of non-stationarity is moderate to large or an improved (compared to existing literature) dynamic regret otherwise. In addition to the classical exploration-exploitation trade-off, our algorithms leverage the power of the “forgetting principle” in their online learning processes, which is vital in changing environments. We further conduct extensive numerical experiments on both synthetic data and the CPRM-12-001: On-Line Auto Lending dataset provided by the Center for Pricing and Revenue Management at Columbia University to show that our proposed algorithms achieve superior dynamic regret performances.
Key words: data-driven decision-making, non-stationary bandit optimization, parameter-free algorithm
Consider the following general decision-making framework: a decision-maker (DM) interacts with an environment by picking actions one at a time sequentially. Upon selecting an action, it instantly receives a reward drawn randomly from a probability distribution tied to this action. The goal of the DM is to maximize its cumulative rewards. However, it also faces the following challenges:
Uncertainty: the reward distribution of each action is initially unknown to the DM, and it thus has to estimate the underlying distributions via interacting with the environment.
Non-Stationarity: moreover, the environment is non-stationary, and the reward distributions can evolve over time.
Partial/Bandit Feedback: finally, the DM can only observe the random reward of the selected action each time; while the rewards of the un-chosen actions remain unknown.
Evidently, the DM faces a trilemma among exploration, exploitation as well as adaptation to changes. On one hand, the DM wishes to exploit, and to selects the action with the best historical performances to earn as much rewards as possible; while on the other, it also wants to explore other actions to get a more accurate estimation of the reward distributions. The changing environment makes the exploration-exploitation trade-off even more delicate. Indeed, past observations could become obsolete due to the changes in the environment, and the DM needs to explore for changes and refrain from exploiting possibly outdated observations. It turns out that many applications fall naturally into this general framework:
an online platform allocates advertisements (ads) to a sequence of users. For each arriving user, the platform has to deliver an ad to her, and only observe the users’ responses to the displayed ads. The platform has full access to the features of the ads and the users. Following (Agrawal and Goyal 2013), we assume that a user’s click behavior towards an ad, or simply the click through rate (CTR) of this ad by a particular user, follows a probability distribution governed by a common, but initially unknown response function that is linear in the features of the ad and the user. The platform’s goal is to maximize the total profit. However, the unknown response function can change over time. For instance, if it is around the time that Apple releases a new iPhone, one can expect that the popularity of an Apple’s ad grows.
a seller decides the (personalized) price dynamically (Keskin and Zeevi 2014, 2016, Besbes and Zeevi 2015, Ban and Keskin 2018) for each of the incoming customers with the hope to maximize sales profit. Begin with an unknown demand function, the DM only observes the purchase decision of a customer under the posted price, but not any other price. In addition, the customers’ reaction towards the same price can vary across time due to the product reviews, the emergence of competitive products, etc.
Our framework is closely related to the multi-armed bandit (MAB) problems. MAB problems are online problems with partial feedback, when the DM is subject to uncertainty in his/her learning environment. Traditionally, most MAB problems are studied in the stochastic (Auer et al. 2002b) and adversarial (Auer et al. 2002a) environments. In the former, the model uncertainty is static and the partial feedback is corrupted by a mean zero random noise. The DM aims at estimating the latent static environment using historical data, and converging to a static optimal decision. In the latter, the model is dynamically changed by an adversary. The DM strives to hedge against the changes, and compete favorably in comparison to certain benchmark policies.
Unfortunately, strategies for the stochastic environments can quickly deteriorate under non-stationarity as historical data might “expire”; while the permission of an confronting adversary in the adversarial settings could be too pessimistic. Recently, a stream of research works (see Related Works) focuses on MAB problems in a drifting environment, which is a hybrid of a stochastic and an adversarial environment. Although the environment can be dynamically and adversarially changed, the total change (quantified by a suitable metric) in a step problem is upper bounded by , the variation budget. The feedback is corrupted by a mean zero random noise. The aim is to minimize the dynamic regret, which is the optimality gap compared to the sequence of (possibly dynamically changing) optimal decisions, by simultaneously estimating the current environment and hedging against future changes every time step. Most of the existing works for non-stationary bandits have focused on the the somewhat ideal case in which is known. In practice, however, is often not available ahead. Though some efforts have been made towards this direction (Karnin and Anava 2016, Luo et al. 2018), how to design algorithms with low dynamic regret when is unknown remains largely as a challenging problem.
In this paper, we design and analyze novel algorithms for bandit problems in a drifting environment. Our main contributions are listed as follows.
When the variation budget is known, we characterize the lower bound of dynamic regret, and develop a tuned Sliding Window Upper-Confidence-Bound (SW-UCB) algorithm with matched dynamic regret upper bound up to logarithmic factors.
When is unknown, we propose a novel Bandit-over-Bandit (BOB) framework that tunes SW-UCB adaptively. When the amount of non-stationarity is moderate to large, the application of BOB on SW-UCB algorithm recovers the optimal dynamic regret; otherwise, it obtains a dynamic regret bound with best dependence on compared to existing literature.
Our results are general in the sense that given any Upper-Confidence-Bound (UCB) algorithm with optimal regret for the stationary bandit settings, we are able to convert it into the corresponding SW-UCB algorithm with optimal dynamic regret (when is known) and the BOB algorithm with nearly optimal dynamic regret (when is unknown), respectively.
Our algorithm design and analysis shed light on the fine balance between exploration, exploitation and adaptation to changes in dynamic learning environments. We rigorously incorporates the “forgetting principle” into the Optimism-in-Face-of-Uncertainty principle, by demonstrating that that the DM should dispose of observations which are sufficiently old. We provide precise criteria for the disposal, and rigorously show the convergence to optimality under these criteria.
Finally, we point out that a preliminary version of this paper will appear in the International Conference on Artificial Intelligence and Statistics (AISTATS 2019) (Cheung et al. 2019), and the current paper is a significantly extended version of it. Specifically, the current version provides a substantially refined analysis and thus a greatly improved regret bound for the BOB algorithm when compared to (Cheung et al. 2019). More importantly, we demonstrate the generality of the established results in the current paper, as exemplified by the extensions to the prevalent -armed bandit and generalized linear bandit settings.
The rest of the paper is organized as follows. In Section id1, we review existing works in stationary and non-stationary environments. In Section id1, we formulate the non-stationary linear bandit model. We note that the choice of linear model is purely for the purpose of illustration, and all the results derived shall be applicable to other settings (Please see Section 4). In Section id1, we establish a minimax lower bound for our problem. In Section id1, we describe the sliding window estimator for parameter estimation under non-stationarity. In Section id1, we develop the sliding window-upper confidence bound algorithm with optimal dynamic regret (when the amount of non-stationarity is known ahead). In Section id1, we introduce the novel bandit-over-bandit framework with nearly optimal regret. In Section 4, we demonstrate the generality of the established results by applying them to other popular bandit settings, such as the multi-armed bandit and generalized linear bandit settings. In Section 10, we conduct extensive numerical experiments with both synthetic and CPRM-12-001: on-line auto lending datasets to show the superior empirical performances of our algorithms. In Section id1, we conclude our paper.
MAB problems with stochastic and adversarial environments are extensively studied, as surveyed in (Bubeck and Cesa-Bianchi 2012, Lattimore and Szepesvári 2018). To model inter-dependence relationships among different arms, models for linear bandits in stochastic environments have been studied. In (Auer 2002, Dani et al. 2008, Rusmevichientong and Tsitsiklis 2010, Chu et al. 2011, Abbasi-Yadkori et al. 2011), UCB type algorithms for stochastic linear bandits were studied, and Abbasi-Yadkori et al. (Abbasi-Yadkori et al. 2011) possessed the state-of-the-art algorithm for the problem. Thompson sampling algorithms proposed in (Russo and Roy 2014, Agrawal and Goyal 2013, Abeille and Lazaric 2017) are able to bypass the high computational complexities provided that one can efficiently sample from the posterior on the parameters and optimize the reward function accordingly. Unfortunately, achieving optimal regret bound via TS algorithms is possible only if the true prior over the reward vector is known.
(Besbes et al. 2014, 2018) considered the -armed bandit in a drifting environment. They achieved the tight dynamic regret bound when is known. Wei et al. (Wei et al. 2016) provided refined regret bounds based on empirical variance estimation, assuming the knowledge of . Subsequently, Karnin et al. (Karnin and Anava 2016) considered the setting without knowing and , and achieved a dynamic regret bound of . In a recent work, (Luo et al. 2018) considered -armed contextual bandits in drifting environments, and in particular demonstrated an improved bound for the -armed bandit problem in drifting environments when is not known, among other results. (Keskin and Zeevi 2016) considered a dynamic pricing problem in a drifting environment with linear demands. Assuming a known variation budget they proved an dynamic regret lower bound and proposed a matching algorithm. When is not known, they designed an algorithm with dynamic regret. In (Besbes et al. 2015), a general problem of stochastic optimization under the known budgeted variation environment was studied. The authors presented various upper and lower bound in the full feedback settings. Finally, various online problems with full information feedback and drifting environments are studied in the literature (Chiang et al. 2012, Jadbabaie et al. 2015).
Apart from drifting environment, numerous research works consider the switching environment, where the time horizon is partitioned into at most intervals, and it switches from one stochastic environment to another across different intervals. The partition is not known to the DM. Algorithms are designed for various bandits, assuming a known (Auer et al. 2002a, Garivier and Moulines 2011, Luo et al. 2018), or assuming an unknown (Karnin and Anava 2016, Luo et al. 2018). Notably, the Sliding Window-UCB and the “forgetting principle” is first proposed by Garivier et al. (Garivier and Moulines 2011), while it is only analyzed under -armed switching environments.
Finally, it is worth pointing out that our Bandit-over-Bandit framework has connections with algorithms for online model selection and bandit corralling, see e.g., (Agarwal et al. 2017, Besbes et al. 2018) and references therein. This and similar techniques have been investigated under the context of non-stationary bandits in (Luo et al. 2018, Besbes et al. 2018). Notwithstanding, existing works either obtain sub-optimal dynamic regret bounds or only empirical performance guarantees.
In this section, we introduce the notations that will be used throughout the discussions and the model formulation.
Throughout the paper, all vectors are column vectors, unless specified otherwise. We define to be the set for any positive integer The notation is the abbreviation of consecutive indexes We use to denote the Euclidean norm of a vector For a positive definite matrix , we use to denote the matrix norm of a vector We also denote and as the maximum and minimum between respectively. When logarithmic factors are omitted, we use to denote function growth.
In each round , a decision set is presented to the DM, and it has to choose an action Afterwards, the reward
is revealed. Here, we allow to be chosen by an oblivious adversary whose actions are independent of those of the DM, and can be determined before the protocol starts (Cesa-Bianchi and Lugosi 2006). The vector of parameter is an unknown -dimensional vector, and is a random noise drawn i.i.d. from an unknown sub-Gaussian distribution with variance proxy . This implies , and we have
Following the convention of existing bandits literature (Abbasi-Yadkori et al. 2011, Agrawal and Goyal 2013), we assume there exist positive constants and such that and holds for all and all and the problem instance is normalized so that for all and
Instead of assuming the stochastic environment, where the reward function remains stationary across the time horizon, we allow it to change over time. Specifically, we consider the general drifting environment: the sum of differences of consecutive ’s should be bounded by some variation budget , i.e.,
We again allow the ’s to be chosen adversarially by an oblivious adversary. We also denote the set of all possible obliviously selected sequences of ’s that satisfies inequality (1) as
The DM’s goal is to design a policy to maximize the cumulative reward, or equivalently to minimize the worst case cumulative regret against the optimal policy , that has full knowledge of ’s. Denoting the dynamic regret of a given policy is defined as
where the expectation is taken with respect to the (possible) randomness of the policy.
We first provide a lower bound on the the regret to characterize the best achievable regret.
For any the dynamic regret of any policy satisfies
Poof Sketch. The construction of the lower bound instance is similar to the approach of (Besbes et al. 2014): nature divides the whole time horizon into blocks of equal length rounds (the last block can possibly have less than rounds). In each block, the nature initiates a new stationary linear bandits instance with parameters from the set Nature also chooses the parameter for a block in a way that depends only on the DM’s policy, and the worst case regret is Since there is at least number of blocks, the total regret is By examining the variation budget constraint, we have that the smallest possible one can take is The statement then follows. Please refer to Section id1 for the complete proof. Q.E.D.
As a preliminary, we introduce the sliding window regularized least squares estimator, which is the key tool in estimating the unknown parameters . Despite the underlying non-stationarity, we show that the estimation error of this estimator can gracefully adapt to the parameter changes.
Consider a sliding window of length and consider the observation history during the time window . The ridge regression problem with regularization parameter is stated below:
Denote as a solution to the regularized ridge regression problem, and define matrix . The solution has the following explicit expression:
The difference has the following expression:
The first term on the right hand side of eq. (4) is the estimation inaccuracy due to the non-stationarity; while the second term is the estimation error due to random noise. We now upper bound the two terms separately. We upper bound the first term in the sense.
For any we have
Poof Sketch.Our analysis relies on bounding the maximum eigenvalue of
for each . Please refer to Section id1 of the appendix for the complete proof. Q.E.D.
By applying (Abbasi-Yadkori et al. 2011), we upper bound the second term in the matrix norm sense.
Lemma 2 ((Abbasi-Yadkori et al. 2011))
For any and any we have
holds with probability at least
From now on, we shall denote
for the ease of presentation. With these two lemmas, we have the following deviation inequality type bound for the latent expected reward of any action in any round
For any and any , we have with probability at least
holds for all
In this section, we describe the Sliding Window Upper Confidence Bound (SW-UCB) algorithm. When the variation budget is known, we show that SW-UCB algorithm with a tuned window size achieves a dynamic regret bound which is optimal up to a multiplicative logarithmic factor. When the variation budget is unknown, we show that SW-UCB algorithm can still be implemented with a suitably chosen window size so that the regret dependency on is optimal, which still results in first order optimality in this case (Keskin and Zeevi 2016).
In the stochastic environment where the reward function is stationary, the well known UCB algorithm follows the principle of optimism in face of uncertainty (Auer et al. 2002b, Abbasi-Yadkori et al. 2011). Under this principle, the DM selects the action that maximizes the UCB, or the value of “mean plus confidence radius” (Auer et al. 2002b). We follow the principle by choosing in each round the action with the highest UCB, i.e.,
When the number of actions is moderate, the optimization problem (6) can be solved by an enumeration over all Upon selecting we have
by virtue of UCB. From Theorem 2, we further have with probability at least
The regret upper bound of the SW-UCB algorithm (to be formalized in Theorem 3) is thus
If is known, the DM can set and achieve a regret upper bound If is not known, which is often the case in practice, the DM can set to obtain a regret upper bound
In this section, we describe the details of the SW-UCB algorithm. Following its design guideline, the SW-UCB algorithm selects a positive regularization parameter and initializes In each round the SW-UCB algorithm first computes the estimate for according to eq. 3, and then finds the action with largest UCB by solving the optimization problem (6). Afterwards, the corresponding reward is observed. The pseudo-code of the SW-UCB algorithm is shown in Algorithm 1.
We are now ready to formally state a regret upper bound of the SW-UCB algorithm.
The dynamic regret of the SW-UCB algorithm is upper bounded as
When is known, by taking the dynamic regret of the SW-UCB algorithm is
When is unknown, by taking the dynamic regret of the SW-UCB algorithm is
Poof Sketch. The proof utilizes the fact that the per round regret of the SW-UCB algorithm is upper bounded by the UCB of the chosen action, and decomposes the UCB into two separated terms according to Lemmas 1 and 2, i.e., regret in round is equal to
The first term can be upper bounded by a intuitive telescoping sum; while for the second term, although a similar quantity is analyzed by the authors of (Abbasi-Yadkori et al. 2011) using a (beautiful) matrix telescoping technique under the stationary environment, we note that due to the “forgetting principle” of the SW-UCB algorithm, we cannot directly adopt the technique. Our proof thus makes a novel use of the Sherman-Morrison formula to overcome the barrier. Please refer to Section id1 of the appendix for the complete proof. Q.E.D.
When the variation budget is known, Theorem 3 recommends choosing the length of the sliding window to be decreasing with . The recommendation is in agreement with the intuition that, when the learning environment becomes more volatile, the DM should focus on more recent observations. Indeed, if the underlying learning environment is changing at a higher rate, then the DM’s past observations become obsolete faster. Theorem 3 pins down the intuition of forgetting past observation in face of drifting environments, by providing the mathematical definition of the sliding window length that yields the optimal regret bound.
In Section id1, we have seen that, by properly tuning the DM can achieve a first order optimal regret bound even if the knowledge of is not available. However, in the case of an unknown and large , i.e., the bound becomes meaningless as it is linear in To handle this case, we wish to design an online algorithm that incurs a dynamic regret of order for some and , without knowing . Note from Theorem 1, no algorithm can achieve a dynamic regret of order , so we must have . In this section, we develop a novel Bandit-over-Bandits (BOB) algorithm that achieves a regret of . Hence, (BOB) still has a dynamic regret sublinear in when for any and is not known, unlike the SW-UCB algorithm.
Reviewing Theorem 3, we know that setting the window length to a fixed value
can give us a regret bound. But when is not provided a priori, we need to also “learn” the unknown in order to properly tune In a more restrictive setting in which the differences between consecutive ’s follow some underlying stochastic process, one possible approach is applying a suitable machine learning technique to learn the underlying stochastic process at the beginning, and tune the parameter accordingly. In the more general setting, however, this strategy cannot work as the change between consecutive ’s can be arbitrary (or even adversarially) as long as the total variation is bounded by
The above mentioned observations as well as the established results motivate us to make use of the SW-UCB algorithm as a sub-routine, and “hedge” (Auer et al. 2002a, Audibert and Bubeck 2009) against the (possibly adversarial) changes of ’s to identify a reasonable fixed window length.
To this end, we describe the main idea of the Bandit-over-Bandits (BOB) algorithm. As illustrated in Fig. 1, the BOB algorithm divides the whole time horizon into blocks of equal length rounds (the last block can possibly have less than rounds), and specifies a set from which each is drawn from. For each block , the BOB algorithm first selects a window length , and restarts the SW-UCB algorithm with the selected window length as a sub-routine to choose actions for this block. On top of this, the BOB algorithm also maintains a separate algorithm for adversarial multi-armed bandits, e.g., the EXP3 algorithm for adversarial multi-armed bandits against an oblivious adversary (Auer et al. 2002a, Audibert and Bubeck 2009), to govern the selection of window length for each block, and thus the name Bandit-over-Bandits. Here, the total reward of each block is used as feedback for the EXP3 algorithm. It is worth emphasizing that
For validity of the EXP3 algorithm, the SW-UCB algorithm for each block does not use any data collected in previous blocks except for the choice of window length (please see Remark 2 for a more involved discussion).
Due to the design of restarting, any instance of the SW-UCB algorithm cannot last for more than rounds. As a consequence, even if the EXP3 selects a window length for some block the effective window length is It is thus reasonable to enforce that is a subset of i.e.,
To determine and , we first consider the regret of the BOB algorithm. As mentioned above, since is not necessarily attainable, i.e., by definition in eq. (12), might be larger than when is small, we hence denote the optimally (over ) tuned window length as By design of the BOB algorithm, its regret can be decomposed as the regret of the SW-UCB algorithm with the optimally tuned window length for each block plus the loss due to learning the value with the EXP3 algorithm, i.e.,
Here, eq. (\theequation@IDy) holds as the BOB algorithm restarts the SW-UCB algorithm in each block, and for a round in block refers to the action selected in round by the SW-UCB algorithm with window length initiated at the beginning of block
is the total variation in block
We then turn to the second expectation in eq. (\theequation@IDy). We can easily see that the number of rounds for the EXP3 algorithm is and the number of possible values of ’s is Denoting the maximum absolute sum of rewards of any block as random variable the authors of (Auer et al. 2002a) gives the following regret bound.
To proceed, we have to give a high probability upper bound for
With probability at least does not exceed i.e.,
Proof Sketch. The proof makes use of the -sub-Gaussian property of the noise terms as well as the union bound. Please refer to Section id1 of the for the complete proof. Q.E.D.
Note that the regret of our problem is at most eq. (15) can be further upper bounded as
On one hand, should be small to control the regret incurred by the EXP3 algorithm in identifying i.e., the third term in eq. (17).
On the others, should also be large enough to allow to get close to so that the sum of the first two terms in eq. (17) is minimized.
A more careful inspection also reveals the tension in the design of Obviously, we hope that is small to minimize the third term in eq. (17), but we also wish to be dense enough so that it forms a cover to the set Otherwise, even if is large enough that can approach approximating with any element in can cause a major loss.
These observations suggest the following choice of
for some positive integer and since the choice of should not depend on we can set with some and to be determined. We then distinguish two cases depending on whether is smaller than or not (or alternatively, whether is larger than or not).
Under this situation, can automatically adapt to the nearly optimal window length , where finds the largest element in that does not exceed Notice that the regret of the BOB algorithm then becomes
Under this situation, equals to which is the window length closest to the regret of the BOB algorithm then becomes
where we have make use of the fact that
Now both eq. (19) and eq. (20) suggests that we should set and eq. (20) further reveals that we should take and Plugging these choices of parameters back to case 1 and eq. (19), we have when the dynamic regret of the BOB algorithm is upper bounded as
while if (or case 2), the dynamic regret of the BOB algorithm is upper bounded as
according to eq. (20). Here we have to emphasize that the choice of and are purely for the purpose of analysis, while the only parameters that we need to decide are
which clearly do not depend on
We are now ready to describe the details of the BOB algorithm. With and defined as eq. (23), the BOB algorithm additionally initiates the parameter
for the EXP3 algorithm (Auer et al. 2002a). The BOB algorithm then divides the time horizon into blocks of length rounds (except for the last block, which can be less than rounds). At the beginning of each block the BOB algorithm first sets
and then sets The selected window length is thus Afterwards, the BOB algorithm selects actions by running the SW-UCB algorithm with window length for each round in block and the total collected reward is
Finally, the rewards are rescaled by dividing and then added by so that it lies within with high probability, and the parameter is set to
while is the same as for all The pseudo-code of the BOB algorithm is shown in Algorithm 2.
We are now ready to present the regret analysis of the BOB algorithm.
The dynamic regret of the BOB algorithm is upper bounded as follows.
Proof Sketch. The proof of the theorem essentially follows Section id1, and please refer to Section id1 of the appendix for the complete proof. Q.E.D.
The block structure and restarting the SW-UCB algorithm with a single window length for each block are essential for the correctness of the BOB algorithm. Otherwise, suppose the DM utilizes the EXP3 algorithm to select the window length for each round and implements the SW-UCB algorithm with the selected window length without ever restarting it. Instead of eq. (\theequation@IDy), the regret of the BOB algorithm is then decomposed as
Here, with some abuse of notations, refers to in round the DM runs the SW-UCB algorithm with window length and historical data, e.g., (action, reward) pairs, generated by running the SW-UCB algorithm with window length for rounds Same as before, the second term of eq. (2) can be upper bounded as a result of Theorem 3. It is also tempting to apply results from the EXP3 algorithm to upper bound the first term. Unfortunately, this is incorrect as it is required by the adversarial bandits protocol (Auer et al. 2002a) that the DM and its competitor should receive the same reward if they select the same action, i.e., the reward of in round and the reward of in round should be the same for every Nevertheless, this is violated as running the SW-UCB algorithm with different window length for previous rounds can generate different (action,reward) pairs, and this results in possibly different estimated ’s for the two SW-UCB algorithms even if both of them use the same window length in round Hence, the selected actions and the corresponding rewards by these two instances might also be different. By the careful design of blocks as well as the restarting scheme, the BOB algorithm decouples the SW-UCB algorithm for a block from previous blocks, and thus fixes the above mentioned problem, i.e., the regret of the BOB algorithm is decomposed as eq. (\theequation@IDy).
The assumptions that the decision sets ’s and the underlying parameters ’s are chosen by an oblivious adversary as well as the i.i.d. noise terms are important. With these, the EXP3 algorithm is also facing an obliviously adversarial environment, and thus satisfies the adversarial bandits protocol (Auer et al. 2002a, Audibert and Bubeck 2009). To see this, assuming additional access to all the i.i.d. (and thus independent of the DMs’ actions) noise terms in advance, the adversary can determine the total rewards of each block with respect to each window length in , independently of the EXP3 algorithm, by running the SW-UCB algorithm with that window length as well as the obliviously chosen ’s and ’s,
The structure of the BOB algorithmcan be roughly seen as using a EXP3 algorithm to govern the behaviors of many copies of SW-UCB algorithms with different window lengths. As mentioned in Section id1, it has a flavor similar to the technique of bandits corralling/aggregation, see e.g., (Agarwal et al. 2017, Besbes et al. 2018) and references therein. Existing works, such as (Luo et al. 2018, Besbes et al. 2018), have tried to apply bandits corralling/aggregation to the non-stationary bandits settings, but when is unknown, they can only obtain either sub-optimal dynamic regret bounds (Luo et al. 2018) or empirical performance guarantees (Besbes et al. 2018).
In this section, we demonstrate the generality of our established results. As illustrative examples, we apply the sliding window UCB algorithm as well as the bandit-over-bandit framework to the prevalent multi-armed bandits (Auer et al. 2002b) and generalized linear bandit settings. In what follows, we shall derive the SW-UCB algorithm as well as the parameters required by the BOB algorithm, i.e., similar to those defined in eq. (23), for both cases. We note that same steps can be applied to other settings, such as Gaussian process bandits (Srinivas et al. 2010) and combinatorial bandits with sem-bandit feedback (Kveton et al. 2015), to derive similar results.
In the -armed bandits setting, every action set is comprised of actions The action has coordinate equals to 1 and all other coordinates equal to Therefore, the reward of choosing action in round is
where is the coordinate of For a window length we also define as the number of times that action is chosen within rounds.