Cheung, SimchiLevi, and Zhu
Drifting Reinforcement Learning
Drifting Reinforcement Learning: The Blessing of (More) Optimism in Face of Endogenous & Exogenous Dynamics
Wang Chi Cheung
\AFFDepartment of Industrial Systems Engineering and Management, National University of Singapore
\EMAILisecwc@nus.edu.sg
\AUTHORDavid SimchiLevi
\AFFInstitute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139
\EMAILdslevi@mit.edu
\AUTHORRuihao Zhu
\AFFInstitute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139
\EMAILrzhu@mit.edu
We consider undiscounted reinforcement learning (RL) in Markov decision processes (MDPs) under temporal drifts, i.e., both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets. This setting captures the endogenous and exogenous dynamics, uncertainty, and partial feedback in sequential decisionmaking scenarios, and finds applications in various online marketplaces, such as vehicle remarketing in usedcar sales, realtime bidding in advertisement auctions, and dynamic pricing in ridesharing platforms. We first develop the Sliding Window UpperConfidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2CW) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the BanditoverReinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2CW algorithm to achieve the same dynamic regret bound, but in a parameterfree manner, i.e., without knowing the variation budgets. Finally, we conduct numerical experiments to show that our proposed algorithms achieve superior empirical performance compared to existing algorithms.
Notably, the interplay between endogenous and exogenous dynamics presents a unique challenge, absent in existing (stationary and nonstationary) stochastic online learning settings, when we apply the conventional Optimism in Face of Uncertainty (OFU) principle to design algorithms with provably low dynamic regret for RL in drifting MDPs. We overcome the challenge by a novel confidence widening technique that incorporates additional optimism into our learning algorithms to ensure low dynamic regret bounds. To extend our theoretical findings, we apply our framework to inventory control problems, and demonstrate how one can alternatively leverage special structures on the state transition distributions to bypass the difficulty in exploring timevarying environments.
drifting reinforcement learning, revenue management, online marketplaces, datadriven decision making, confidence widening
1 Introduction
Consider a sequential decisionmaking framework, where a decisionmaker (DM) interacts with a discrete time Markov decision process (MDP) iteratively. At each time step, the DM first observes the current state of the MDP, and then chooses an available action. After that, she receives an instantaneous random reward, and the MDP transits to the next state. The DM aims to design a policy that maximizes its cumulative rewards, while facing the following challenges:

Endogenous dynamics: At each time step, the reward follows a reward distribution, and the subsequent state follows a state transition distribution. Both distributions depend on the current state and action, which are influenced by the policy.

Exogenous dynamics: The reward and state transition distributions vary (independently of the policy) across time steps, but the total variations are bounded by the respective variation budgets.

Uncertainty: Both the reward and state transition distributions are initially unknown to the DM.

Bandit/Partial feedback: The DM can only observe the reward and state transition resulted by the current state and action in each time step.
It turns out that many applications, such as vehicle remarketing in usedcar sales and realtime bidding in advertisement (ad) auctions, can be captured by this framework.

Vehicle remarketing in usedcar sales: An automobile company disposes of continually arriving offlease vehicles (i.e., leasing vehicles that have reached the end of their fixed term) via daily wholesale vehicle auctions (Manheim 2020, Vehicle Remarketing 2020). At the beginning of each auction, the company decides the number of offlease vehicles to be listed, and then the car dealers bid for the purchases via a firstprice auction. The sales of vehicles generate revenue to the company while unsold vehicles incur holding cost to the company. The company aims at maximizing profit by designing a policy that dynamically decides the vehicles to be listed in each auction. However, the dealers’ bidding behaviors are affected by many unpredictable (and thus exogenous) factors (e.g., realtime customer demands, vehicles’ depreciation, and interdealer competitions) in addition to the company’s decisions (i.e., the vehicles listed), and can vary across time.

Realtime bidding in ad auctions: Advertisers repeatedly competes for ad display impressions via realtime online auctions (Google 2011). Each advertiser begins with a (periodically refilled) budget. Upon the arrival of a user, an impression is generated, and the advertisers submit bids for it. Then, the winning advertiser makes the payment (determined by the auction mechanism) using her remaining budget, and display her ad to the user. Finally, she observes the user feedback (Cai et al. 2017, Flajolet and Jaillet 2017, Balseiro and Gur 2019, Choi and Sayedi 2019, Guo et al. 2019). Each advertiser wants to maximize the number of clicks on her advertisement. Nevertheless, the competitiveness of each auction exhibits exogeneity as the participating advertisers and the arriving users are different from time to time. Moreover, the popularity of an ad can change due to endogeneous reasons. For instance, displaying the same ad too frequently in a short period of time might reduce its freshness, and results in a tentatively low number of clicks (i.e., we can incorporate both the remaining budget and the number of times that the ad is shown within a given window size into the state of the MDP to model endogenous dynamics).
Besides, this framework can be used to model sequential decisionmaking problems in ridesharing (Banerjee et al. 2015, Gurvich et al. 2018, Taylor 2018, Bimpikis et al. 2019, Kanoria and Qian 2019), transportaion (Zhang and Wang 2018, Qin et al. 2019), healthcare operations (Shortreed et al. 2010, Yu et al. 2019), and inventory control (Besbes and Muharremoglu 2013, Bertsekas 2017, Zhang et al. 2018, Agrawal and Jia 2019, Chen et al. 2019a).
There exist numerous works in sequential decisionmaking that considered part of the four challenges. The traditional stream of research (Auer et al. 2002b, Bubeck and CesaBianchi 2012, Lattimore and Szepesvári 2018) on stochastic multiarmed bandits (MAB) focuses on the interplay between uncertainty and bandit feedback (i.e., challenges 3 and 4), and (Auer et al. 2002b) propose the classical Upper Confidence Bound (UCB) algorithm. Starting from (Burnetas and Katehakis 1997, Tewari and Bartlett 2008, Jaksch et al. 2010), a volume of works (see Section 3) have been devoted to reinforcement learning (RL) (Sutton and Barto 2018) in stationary MDPs, which further involves endogenous dynamics. Stationary MDPs incorporate challenges 1,3,4, and stochastic MAB is a special case of online MDPs when there is only one state. In the absence of exogenous dynamics, the reward and state transition distributions are invariant across time, and these three challenges can be jointly solved by the Upper Confidence bound for Reinforcement Learning (UCRL2) algorithm (Jaksch et al. 2010).
The UCB and UCRL2 algorithms leverage the optimism in face of uncertainty (OFU) principle to select actions iteratively based on the entire collections of historical data. However, both algorithms quickly deteriorate when exogenous dynamics emerge, since the historical data become obsolete. To address the challenge of exogenous dynamics, there is a recent line of research initiated by (Besbes et al. 2014) that studies the drifting bandit environments, in which the reward distributions can change arbitrarily and independently of the actions chosen over time, but the total changes (quantified by a suitable metric) is upper bounded by a variation budget (Besbes et al. 2014). The aim is to minimize the dynamic regret, the optimality gap compared to the cumulative rewards of the sequence of optimal actions. The drifting bandit setting addresses the challenges of uncertainty, partial feedback, and exogenous dynamics (i.e., challenges 2, 3, 4), but endogenous dynamics (challenge 1) are not present. In (Jaksch et al. 2010), the authors also consider RL in piecewisestationary MDPs. Nevertheless, we show in Section 6 that simply adopting the techniques for drifting bandits or switching MDPs to the setting of drifting MDPs can result in poor dynamic regret bounds.
In this paper, we consider RL in drifting MDPs, i.e., sequential decisionmaking under the above mentioning four challenges. We assume that, during the time steps, the total variations of the reward and state transition distributions are bounded (under suitable metrics) by the variation budgets and respectively. We design and analyze novel algorithms for RL in drifting MDPs. Let and be the maximum diameter (a complexity measure to be defined in Section 2), number of states, and number of actions in the MDP. Our main contributions are:

We develop the Sliding Window UCRL2 with Confidence Widening (SWUCRL2CW) algorithm. When the variation budgets are known, we prove it attains a dynamic regret bound via a budgetaware analysis.

We propose the BanditoverReinforcement Learning (BORL) algorithm that tunes the SWUCRL2CW algorithm adaptively, and retains the same dynamic regret bound without knowing the variation budgets.

We identify an unprecedented challenge for RL in drifting MDPs with optimistic exploration: existing algorithmic frameworks for nonstationary online learning (Jaksch et al. 2010, Cheung et al. 2019b) typically estimate unknown parameters by averaging historical data in a “forgetting” fashion, and construct the tightest possible confidence regions accordingly. They then optimistically search for the most favorable model within the confidence regions, and execute the corresponding optimal policy. If the DM follows this guideline, she would repeatedly execute an updated optimal policy on an “optimistic” MDP when learning drifting MDPs. However, the diameters of the optimistic MDPs constructed in this manner can grow wildly, and may result in unfavorable dynamic regret bound. We overcome this with our novel proposal of extra optimism via the confidence widening technique.

As a complement to this finding, suppose for any pair of initial state and target state, there always exists an action such that the probability of transitioning from the initial state to the target state by taking this action is lower bounded over the entire time horizon, the DM can attain low dynamic regret without widening the confidence regions. We demonstrate that in the context of inventory control, a mild condition on the demand distribution is sufficient for this extra assumption to hold.
The rest of the paper is organized as follows: in Section 2, we describe the nonstationary MDP model of interest. In Section 3, we review related works in nonstationary online learning and reinforcement learning. In Section 4, we introduce the SWUCRL2CW algorithm, and analyze its performance in terms of dynamic regret. In Section 5, we design the BORL algorithm that can attain the same dynamic regret bound as the SWUCRL2CW algorithm without knowing the total variations. In Section 6, we discuss the challenges in designing learning algorithms for reinforcement learning under drift, and manifest how the novel confidence widening technique can mitigate this issue. In Section 7, we discuss the alternative approach without widening the confidence regions. In Section 8, we conduct numerical experiments to show the superior empirical performances of our algorithms. In Section 9, we conclude our paper.
2 Problem Formulation
An instance of nonstationary online MDP is specified by the tuple . The set is a finite set of states. The collection contains a finite action set for each state . We say that is a stateaction pair if . We denote , We denote as the total number of time steps, and denote as the sequence of mean rewards. For each , we have , and for each stateaction pair . In addition, we denote as the sequence of transition kernels. For each , we have , where is a probability distribution over for each stateaction pair .
The quantities vary across different ’s in general. Following (Besbes et al. 2014), we quantify the variation on in terms of their respective variation budgets :
(1) 
We emphasize although and might be used as inputs by the DM, individual ’s and ’s are unknown to the DM throughout the current paper.
Dynamics. The DM faces an online nonstationary MDP instance . She knows , but not . The DM starts at an arbitrary state . At time , three events happen. First, the DM observes its current state . Second, she takes an action . Third, given , she stochastically transits to another state which is distributed as , and receives a stochastic reward , which is 1subGaussian with mean . In the second event, the choice of is based on a nonanticipatory policy . That is, the choice only depends on the current state and the previous observations .
Dynamic Regret. The DM aims to maximize the cumulative expected reward , despite the model uncertainty on and the nonstationarity of the learning environment. To measure the convergence to optimality, we consider an equivalent objective of minimizing the dynamic regret (Besbes et al. 2014, Jaksch et al. 2010)
(2) 
In the offline benchmark , the summand is the optimal long term reward of the stationary MDP with transition kernel and mean reward The optimum can be computed by solving linear program (13) provided in Section A.1. {remark} When , (2) reduces to the definition (Besbes et al. 2014) of dynamic regret for nonstationary armed bandits. Nevertheless, different from the bandit case, the offline benchmark does not equal to the expected optimum for the nonstationary MDP problem in general.
Next, we review relevant concepts on MDPs, in order to stipulate an assumption that ensures learnability and justifies our offline bnechmark. {definition}[Communicating MDPs and Diameters (Jaksch et al. 2010)] Consider a set of states , a collection of action sets, and a transition kernel . For any and stationary policy , the hitting time from to under is the random variable which can be infinite. We say that is a communicating MDP iff
is finite. The quantity is the diameter associated with . Throughout the paper, we make the following assumption. {assumption} For each , the tuple constitutes a communicating MDP with diameter at most . We denote the maximum diameter as
The following proposition justifies our choice of offline benchmark . {proposition} Consider an instance that satisfies Assumption 2 with maximum diameter , and has variation budgets for the rewards and transition kernels respectively. In addition, suppose that . It holds that
The maximum is taken over all nonanticipatory policies ’s. We denote as the trajectory under policy , where is determined based on and , and for each . The Proposition is proved in Appendix A.2. In fact, our dynamic regret bounds are larger than the error term , thus justifying the choice of as the offline benchmark. The offline benchmark is more convenient for analysis than the expected optimum, since the former can be decomposed to summations across different intervals, unlike the latter where the summands are intertwined (since ).
3 Related Works
Learning stationary undiscounted MDP has been studied in (Burnetas and Katehakis 1997, Bartlett and Tewari 2009, Jaksch et al. 2010, Agrawal and Jia 2017, Fruit et al. 2018a, b). For learning nonstationary MDPs, the stream of works (Dick et al. 2014, Cardoso et al. 2019) considered changing reward distributions but fixed transition kernels. (Yu and Mannor 2009, Yu et al. 2009) allowed arbitrary changes in reward, but bounded changes in the transition kernels, and design algorithms under Markov chain mixing assumptions. (Jaksch et al. 2010, Gajane et al. 2018) proposed solutions for the piecewise stationary setting. (EvenDar et al. 2005, AbbasiYadkori et al. 2013, Li et al. 2019) considered learning MDPs with full information feedback in various adversarial and nonstationary environments. Episodic MDPs with adversarially changing rewards and stationary transition kernels are studied under full information feedback (Neu et al. 2012, Rosenberg and Mansour 2019a) and bandit feedback (Neu et al. 2010b, a, Arora et al. 2012, Rosenberg and Mansour 2019b, Jin et al. 2019). In (Nilim and Ghaoui 2005, Xu and Mannor 2006), robust control of nonstationary MDPs were studied.
In a parallel work, (Ortner et al. 2019) considered a similar setting to ours by applying the “forgetting principle” from nonstationary bandit settings (Garivier and Moulines 2011b, Cheung et al. 2019a) to design a learning algorithm. To achieve its dynamic regret bound, the algorithm by (Ortner et al. 2019) partitions the entire time horizon into time intervals and crucially requires the access to and i.e., the variations in both reward and state transition distributions of each interval (see Theorem 3 in (Ortner et al. 2019)). In contrast, the SWUCRL2CW algorithm and the BORL algorithm require significantly less information on the variations. Specifically, the SWUCRL2CW algorithm does not need any additional knowledge on the variations except for and i.e., the variation budgets over the entire time horizon as defined in eqn. (1), to achieve its dynamic regret bound (see Theorem 4.3). This is similar to algorithms for the drifting bandit settings, which only require the access to (Besbes et al. 2014). More importantly, the BORL algorithm (built upon the SWUCRL2CW algorithm) enjoys the same dynamic regret bound even without knowing either of or (see Theorem 5.3).
For online learning and bandit problems where there is only one state, the works by (Auer et al. 2002a, Garivier and Moulines 2011a, Besbes et al. 2014, Keskin and Zeevi 2016, Russac et al. 2019) proposed several “forgetting” strategies for different settings. More recently, the works by (Jadbabaie et al. 2015, Karnin and Anava 2016, Luo et al. 2018, Cheung et al. 2019b, a, Chen et al. 2019b) designed parameterfree algorithms for nonstationary online learning.
4 Sliding Window UCRL2 with Confidence Widening
In this section, we present the SWUCRL2CW algorithm, which incorporates sliding window estimates (Garivier and Moulines 2011a) and a novel confidence widening technique into UCRL2 (Jaksch et al. 2010).
4.1 Design Overview
The SWUCRL2CW algorithm first specifies a sliding window parameter and a confidence widening parameter . Parameter specifies the number of previous time steps to look at. Parameter quantifies the amount of additional optimistic exploration, on top of the conventional optimistic exploration using upper confidence bounds. The former turns out to be necessary for handling the drifting nonstationarity of the transition kernel.
The algorithm runs in a sequence of episodes that partitions the time steps. Episode starts at time (in particular ), and ends at the end of time . Throughout an episode the DM follows a certain stationary policy The DM ceases the episode if at least one of the following two criteria is met:

The time index is a multiple of Consequently, each episode last for at most time steps. The criterion ensures that the DM switches the stationary policy frequently enough, in order to adapt to the nonstationarity of ’s and ’s.

There exists some stateaction pair such that the number of time step ’s with within episode is at least as many as the total number of counts for it within the time steps prior to i.e., from to This is similar to the doubling criterion in (Jaksch et al. 2010), which ensures that each episode is sufficiently long so that the DM can focus on learning.
The combined effect of these two criteria allows the DM to learn a low dynamic regret policy with historical data from an appropriately sized time window. One important piece of ingredient is the construction of the policy for each episode . To allow learning under nonstationarity, the SWUCRL2CW algorithm computes the policy based on the history in the time steps previous to the current episode i.e., from round to round . The construction of involves the Extended Value Iteration (EVI) (Jaksch et al. 2010), which requires the confidence regions for rewards and transition kernels as the inputs, in addition to an precision parameter . The confidence widening parameter is capable of ensuring the MDP output by the EVI has a bounded diameter most of the time.
4.2 Policy Construction
To describe SWUCRL2CW algorithm, we first define for each state action pair and each time step in episode
(3) 
Confidence Region for Rewards
For each state action pair and each time step in episode , we consider the empirical mean estimator
which serves to estimate the average reward
The confidence region is defined as
(4) 
with confidence radius
Confidence Widening for Transition Kernels.
or each state action pair and each time step in episode , we consider the empirical mean estimator
which serves to estimate the average transition probability
(5) 
Different from the case of estimating reward, the confidence region for the transition probability involves a widening parameter :
(6) 
with confidence radius
With , the DM can explore transition kernels that deviate from the sample average, and the exploration is crucial for learning MDPs under nonstationarity. In a nutshell, the incorporation of provides an additional source of optimism. We treat as a hyperparameter at the moment, and provide a suitable choice of when we discuss our main results.
Extended Value Iteration (EVI) (Jaksch et al. 2010).
The SWUCRL2CW algorithm relies on the EVI, which solves MDPs with optimistic exploration to nearoptimality. We extract and rephrase a description of EVI in Appendix A.3. EVI inputs the confidence regions for the rewards and the transition kernels. The algorithm outputs an “optimistic MDP model”, which consists of reward vector and transition kernel under which the optimal average gain is the largest among all :

Input: Confidence regions for , for and an error parameter

Output: The returned policy and the auxiliary output In the latter, and are the selected “optimistic” reward vector, transition kernel, and the corresponding long term average reward. The output is a bias vector (Jaksch et al. 2010). For each , the quantity is indicative of the short term reward when the DM starts at state and follows the optimal policy. By the design of EVI, for the output , there exists such that . Altogether, we express
Combining the three components, a formal description of the SWUCRL2CW algorithm is shown in Algorithm 1.
4.3 Performance Analysis: The Blessing of More Optimism
We now analyze the performance of the SWUCRL2CW algorithm. First, we introduce two events which state that the estimated reward and transition kernels lie in the respective confidence regions.
We prove that hold with high probability. {lemma} We have , . The proof is in Section B of the appendix. In defining , the widening parameter is set to be 0, since we are only concerned with the estimation error on . Next, we bound the dynamic regret of each time step, under certain assumptions on . To facilitate our discussion, we define the following variation measure for each in an episode :
Consider an episode . Condition on events and suppose that there exists a transition kernel satisfying two properties: (1) we have , and (2) the diameter of at most . Then, for every in episode , we have
(7)  
(8) 
The complete proof can be found in Section C of the appendix. Unlike Lemma 4.3, the parameter plays an important role in the Proposition. As increases, the confidence region becomes larger for each , and the assumed diameter is expected to decrease. Our subsequent analysis shows that can be suitably calibrated so that . Next, we state our first main result, which provides a dynamic regret bound assuming the knowledge of to set : {theorem} Assuming the SWUCRL2CW algorithm with window size and confidence widening parameter satisfies the dynamic regret bound
with probability . If we further put
and
this is
Proof.
Proof Sketch. The complete proof is presented in Section D of the appendix. To facilitate the exposition, we denote as the total number of episodes. By abusing the notation, we let . Episode is interrupted and the algorithm is forced to terminate, as the end of time is reached. To proceed, we define the set
For each episode , we distinguish two cases:

Case 1. Under this situation, we apply Proposition 4.3 to bound the dynamic regret during the episode, using the fact that satisfies the assumptions of the proposition with .

Case 2. . In this case, we trivially upper bound the dynamic regret of each round in episode by
For case 1, we bound the dynamic regret during episode by summing the error terms in (7, 8) across the rounds in the episode. The term (7) accounts for the error by switching policies. In (8), the terms accounts for the estimation errors due to stochastic variations, and the term accounts for the estimation error due to nonstationarity.
For case 2, we need an upper bound on , the total number of rounds that belong to an episode in . The analysis is challenging, since the length of each episode may vary, and one can only guarantee that the length is . A first attempt could be to upper bound as , but the resulting bound appears too loose to provide any meaningful regret bound. Indeed, there could be double counting, as the starting time steps for a pair of episodes in case 2 might not even be rounds apart!
To avoid the trap of double counting, we consider a set where the start times of the episodes are sufficiently far apart, and relate the cardinality of to . The set is constructed sequentially, by examining all episodes in the time order. At the start, we initialize . For each , we perform the following. If episode satisfies both criteria:

There exists some and such that

For every
then we add into Afterwards, we move to the next episode index . The process terminates once we arrive at episode The construction ensures that, for each episode if for all then otherwise, would have been added into
By virtue of the confidence widening, we argue that every episode consumes at least variation budget It turns out that we can upper bound the cardinality of as and we also can upper bound ∎
Proposition 4.3 states that if the confidence region contains a transition kernel that induces a MDP with bounded diameter the EVI supplied with can return a policy with controllable dynamic regret bound. However, as we show in Section 6, one in general cannot expect this to happen. Nevertheless, we bypass this with our novel confidence widening technique and a budgetaware analysis. We consider the first time step of each episode if for all then Proposition 4.3 can be leveraged; otherwise, the widened confidence region enforces that a considerable amount of variation budget is consumed. {remark} When , our problem becomes the nonstationary bandit problem studied by (Besbes et al. 2014), and we have and By choosing , our algorithm has dynamic regret , matching the minimax optimal dynamic regret bound by (Besbes et al. 2014) when {remark} Similar to (Cheung et al. 2019b, a), if are not known, we can set and obliviously as to obtain a dynamic regret bound
5 BanditoverReinforcement Learning: Towards ParameterFree
As pointed out by Remark 4.3, in the case of unknown and the dynamic regret of SWUCRL2CW algorithm scales linearly in and . However, by Theorem 4.3, we are assured a fixed pair of parameters can ensure low dynamic regret. For the bandit setting, (Cheung et al. 2019a, b) propose the banditoverbandit framework that uses a separate copy of EXP3 algorithm to tune the window size. Inspired by it, we develop a novel BanditoverReinforcement Learning (BORL) algorithm with parameterfree dynamic regret here.
5.1 Design Overview
Following a similar line of reasoning as (Cheung et al. 2019a), we make use of the SWUCRL2CW algorithm as a subroutine, and “hedge” (Bubeck and CesaBianchi 2012) against the (possibly adversarial) changes of ’s and ’s to identify a reasonable fixed window size and confidence widening parameter.
As illustrated in Fig. 1, the BORL algorithm divides the whole time horizon into blocks of equal length rounds (the length of the last block can ), and specifies a set from which each pair of (window size, confidence widening) parameter are drawn from. For each block , the BORL algorithm first calls some master algorithm to select a pair of (window size, confidence widening) parameters , and restarts the SWUCRL2CW algorithm with the selected parameters as a subroutine to choose actions for this block. Afterwards, the total reward of block is fed back to the master, and the “posterior” of these parameters are updated accordingly.
One immediate challenge not presented in the bandit setting (Cheung et al. 2019b) is that the starting state of each block is determined by previous moves of the DM. Hence, the master algorithm is not facing a simple oblivious environment as the case in (Cheung et al. 2019b), and we cannot use the EXP3 (Auer et al. 2002a) algorithm as the master. Nevertheless, fortunately the state is observed before the starting of a block. Thus, we use the EXP3.P algorithm for multiarmed bandit against an adaptive adversary (Auer et al. 2002a, Bubeck and CesaBianchi 2012) as the master algorithm. We follow the exposition in Section 3.2 in (Bubeck and CesaBianchi 2012) for adapting the EXP3.P algorithm.
5.2 Design Details
We are now ready to state the details of the BORL algorithm. For some fixed choice of block length (to be determined later), we first define a couple of additional notations:
(9)  
Here, and are all possible choices of window size and confidence widening parameter, respectively, and is the Cartesian product of them with We also let be the total rewards for running the SWUCRL2CW algorithm with window size and confidence widening parameter for block starting from state
The EXP3.P algorithm treats each element of as an arm. It begins by initializing
(10) 
where At the beginning of each block the BORL algorithm first sees the state and computes
(11) 
Then it sets with probability The selected pair of parameters are thus and Afterwards, the BORL algorithm starts from state selects actions by running the SWUCRL2CW algorithm with window size and confidence widening parameter for each round in block At the end of the block, the BORL algorithm observes the total rewards As a last step, it rescales by dividing it by so that it is within and updates
(12) 
The formal description of the BORL algorithm (with defined in the next subsection) is shown in Algorithm 2.
5.3 Performance Analysis
The dynamic regret guarantee of the BORL algorithm can be presented as follows {theorem} Assume with probability the dynamic regret bound of the BORL algorithm is The proof is provided in Section E of the appendix.
6 The Perils of Drift in Learning Markov Decision Processes
In online stochastic environments, one usually estimates a latent quantity by taking the time average of observed samples, even when the sample distribution varies across time. This has been proved to work well in nonstationary bandit settings (Garivier and Moulines 2011a, Cheung et al. 2019a, b). To extend to RL, it is natural to consider the sample average transition kernel , which uses the data in the previous rounds to estimate the time average transition kernel to within an additive error (see Section 4.2.3 and Lemma 4.3). In the case of stationary MDPs, where , one has Thus, the unwidened confidence region contains with high probability. Consequently, the UCRL2 algorithm by (Jaksch et al. 2010), which optimistic explores , has a regret that scales linearly with the diameter of .
The approach of optimistic exploring is further extended to RL in piecewise stationary MDPs by (Jaksch et al. 2010, Gajane et al. 2018). The latter establishes a dynamic regret bounds, when there are at most changes. Their analyses involve partitioning the round horizon into equallength intervals, where is a constant dependent on . At least intervals enjoy stationary environments, and optimistic exploring in these intervals yields a dynamic regret bound that scales linearly with . Bounding the dynamic regret of the remaining intervals by their lengths and tuning yield the desired bound.
In contrast to the stationary and piecewisestationary settings, optimistic exploration on might lead to unfavorable dynamic regret bounds in drifting MDPs. In the drifting environment where are generally distinct, we show that it is impossible to bound the diameter of in terms of the maximum of the diameters of . More generally, we demonstrate the previous claim not only for , but also for every