1 Introduction


Cheung, Simchi-Levi, and Zhu


Drifting Reinforcement Learning


Drifting Reinforcement Learning: The Blessing of (More) Optimism in Face of Endogenous & Exogenous Dynamics


Wang Chi Cheung \AFFDepartment of Industrial Systems Engineering and Management, National University of Singapore
\EMAILisecwc@nus.edu.sg \AUTHORDavid Simchi-Levi \AFFInstitute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139
\EMAILdslevi@mit.edu \AUTHORRuihao Zhu \AFFInstitute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139


We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under temporal drifts, i.e., both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets. This setting captures the endogenous and exogenous dynamics, uncertainty, and partial feedback in sequential decision-making scenarios, and finds applications in various online marketplaces, such as vehicle remarketing in used-car sales, real-time bidding in advertisement auctions, and dynamic pricing in ride-sharing platforms. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner, i.e., without knowing the variation budgets. Finally, we conduct numerical experiments to show that our proposed algorithms achieve superior empirical performance compared to existing algorithms.

Notably, the interplay between endogenous and exogenous dynamics presents a unique challenge, absent in existing (stationary and non-stationary) stochastic online learning settings, when we apply the conventional Optimism in Face of Uncertainty (OFU) principle to design algorithms with provably low dynamic regret for RL in drifting MDPs. We overcome the challenge by a novel confidence widening technique that incorporates additional optimism into our learning algorithms to ensure low dynamic regret bounds. To extend our theoretical findings, we apply our framework to inventory control problems, and demonstrate how one can alternatively leverage special structures on the state transition distributions to bypass the difficulty in exploring time-varying environments.


drifting reinforcement learning, revenue management, online marketplaces, data-driven decision making, confidence widening

1 Introduction

Consider a sequential decision-making framework, where a decision-maker (DM) interacts with a discrete time Markov decision process (MDP) iteratively. At each time step, the DM first observes the current state of the MDP, and then chooses an available action. After that, she receives an instantaneous random reward, and the MDP transits to the next state. The DM aims to design a policy that maximizes its cumulative rewards, while facing the following challenges:

  1. Endogenous dynamics: At each time step, the reward follows a reward distribution, and the subsequent state follows a state transition distribution. Both distributions depend on the current state and action, which are influenced by the policy.

  2. Exogenous dynamics: The reward and state transition distributions vary (independently of the policy) across time steps, but the total variations are bounded by the respective variation budgets.

  3. Uncertainty: Both the reward and state transition distributions are initially unknown to the DM.

  4. Bandit/Partial feedback: The DM can only observe the reward and state transition resulted by the current state and action in each time step.

It turns out that many applications, such as vehicle remarketing in used-car sales and real-time bidding in advertisement (ad) auctions, can be captured by this framework.

  • Vehicle remarketing in used-car sales: An automobile company disposes of continually arriving off-lease vehicles (i.e., leasing vehicles that have reached the end of their fixed term) via daily wholesale vehicle auctions (Manheim 2020, Vehicle Remarketing 2020). At the beginning of each auction, the company decides the number of off-lease vehicles to be listed, and then the car dealers bid for the purchases via a first-price auction. The sales of vehicles generate revenue to the company while unsold vehicles incur holding cost to the company. The company aims at maximizing profit by designing a policy that dynamically decides the vehicles to be listed in each auction. However, the dealers’ bidding behaviors are affected by many unpredictable (and thus exogenous) factors (e.g., real-time customer demands, vehicles’ depreciation, and inter-dealer competitions) in addition to the company’s decisions (i.e., the vehicles listed), and can vary across time.

  • Real-time bidding in ad auctions: Advertisers repeatedly competes for ad display impressions via real-time online auctions (Google 2011). Each advertiser begins with a (periodically refilled) budget. Upon the arrival of a user, an impression is generated, and the advertisers submit bids for it. Then, the winning advertiser makes the payment (determined by the auction mechanism) using her remaining budget, and display her ad to the user. Finally, she observes the user feedback (Cai et al. 2017, Flajolet and Jaillet 2017, Balseiro and Gur 2019, Choi and Sayedi 2019, Guo et al. 2019). Each advertiser wants to maximize the number of clicks on her advertisement. Nevertheless, the competitiveness of each auction exhibits exogeneity as the participating advertisers and the arriving users are different from time to time. Moreover, the popularity of an ad can change due to endogeneous reasons. For instance, displaying the same ad too frequently in a short period of time might reduce its freshness, and results in a tentatively low number of clicks (i.e., we can incorporate both the remaining budget and the number of times that the ad is shown within a given window size into the state of the MDP to model endogenous dynamics).

Besides, this framework can be used to model sequential decision-making problems in ride-sharing (Banerjee et al. 2015, Gurvich et al. 2018, Taylor 2018, Bimpikis et al. 2019, Kanoria and Qian 2019), transportaion (Zhang and Wang 2018, Qin et al. 2019), healthcare operations (Shortreed et al. 2010, Yu et al. 2019), and inventory control (Besbes and Muharremoglu 2013, Bertsekas 2017, Zhang et al. 2018, Agrawal and Jia 2019, Chen et al. 2019a).

There exist numerous works in sequential decision-making that considered part of the four challenges. The traditional stream of research (Auer et al. 2002b, Bubeck and Cesa-Bianchi 2012, Lattimore and Szepesvári 2018) on stochastic multi-armed bandits (MAB) focuses on the interplay between uncertainty and bandit feedback (i.e., challenges 3 and 4), and (Auer et al. 2002b) propose the classical Upper Confidence Bound (UCB) algorithm. Starting from (Burnetas and Katehakis 1997, Tewari and Bartlett 2008, Jaksch et al. 2010), a volume of works (see Section 3) have been devoted to reinforcement learning (RL) (Sutton and Barto 2018) in stationary MDPs, which further involves endogenous dynamics. Stationary MDPs incorporate challenges 1,3,4, and stochastic MAB is a special case of online MDPs when there is only one state. In the absence of exogenous dynamics, the reward and state transition distributions are invariant across time, and these three challenges can be jointly solved by the Upper Confidence bound for Reinforcement Learning (UCRL2) algorithm (Jaksch et al. 2010).

The UCB and UCRL2 algorithms leverage the optimism in face of uncertainty (OFU) principle to select actions iteratively based on the entire collections of historical data. However, both algorithms quickly deteriorate when exogenous dynamics emerge, since the historical data become obsolete. To address the challenge of exogenous dynamics, there is a recent line of research initiated by (Besbes et al. 2014) that studies the drifting bandit environments, in which the reward distributions can change arbitrarily and independently of the actions chosen over time, but the total changes (quantified by a suitable metric) is upper bounded by a variation budget (Besbes et al. 2014). The aim is to minimize the dynamic regret, the optimality gap compared to the cumulative rewards of the sequence of optimal actions. The drifting bandit setting addresses the challenges of uncertainty, partial feedback, and exogenous dynamics (i.e., challenges 2, 3, 4), but endogenous dynamics (challenge 1) are not present. In (Jaksch et al. 2010), the authors also consider RL in piecewise-stationary MDPs. Nevertheless, we show in Section 6 that simply adopting the techniques for drifting bandits or switching MDPs to the setting of drifting MDPs can result in poor dynamic regret bounds.

In this paper, we consider RL in drifting MDPs, i.e., sequential decision-making under the above mentioning four challenges. We assume that, during the time steps, the total variations of the reward and state transition distributions are bounded (under suitable metrics) by the variation budgets and respectively. We design and analyze novel algorithms for RL in drifting MDPs. Let and be the maximum diameter (a complexity measure to be defined in Section 2), number of states, and number of actions in the MDP. Our main contributions are:

  • We develop the Sliding Window UCRL2 with Confidence Widening (SWUCRL2-CW) algorithm. When the variation budgets are known, we prove it attains a dynamic regret bound via a budget-aware analysis.

  • We propose the Bandit-over-Reinforcement Learning (BORL) algorithm that tunes the SWUCRL2-CW algorithm adaptively, and retains the same dynamic regret bound without knowing the variation budgets.

  • We identify an unprecedented challenge for RL in drifting MDPs with optimistic exploration: existing algorithmic frameworks for non-stationary online learning (Jaksch et al. 2010, Cheung et al. 2019b) typically estimate unknown parameters by averaging historical data in a “forgetting” fashion, and construct the tightest possible confidence regions accordingly. They then optimistically search for the most favorable model within the confidence regions, and execute the corresponding optimal policy. If the DM follows this guideline, she would repeatedly execute an updated optimal policy on an “optimistic” MDP when learning drifting MDPs. However, the diameters of the optimistic MDPs constructed in this manner can grow wildly, and may result in unfavorable dynamic regret bound. We overcome this with our novel proposal of extra optimism via the confidence widening technique.

  • As a complement to this finding, suppose for any pair of initial state and target state, there always exists an action such that the probability of transitioning from the initial state to the target state by taking this action is lower bounded over the entire time horizon, the DM can attain low dynamic regret without widening the confidence regions. We demonstrate that in the context of inventory control, a mild condition on the demand distribution is sufficient for this extra assumption to hold.

The rest of the paper is organized as follows: in Section 2, we describe the non-stationary MDP model of interest. In Section 3, we review related works in non-stationary online learning and reinforcement learning. In Section 4, we introduce the SWUCRL2-CW algorithm, and analyze its performance in terms of dynamic regret. In Section 5, we design the BORL algorithm that can attain the same dynamic regret bound as the SWUCRL2-CW algorithm without knowing the total variations. In Section 6, we discuss the challenges in designing learning algorithms for reinforcement learning under drift, and manifest how the novel confidence widening technique can mitigate this issue. In Section 7, we discuss the alternative approach without widening the confidence regions. In Section 8, we conduct numerical experiments to show the superior empirical performances of our algorithms. In Section 9, we conclude our paper.

2 Problem Formulation

An instance of non-stationary online MDP is specified by the tuple . The set is a finite set of states. The collection contains a finite action set for each state . We say that is a state-action pair if . We denote , We denote as the total number of time steps, and denote as the sequence of mean rewards. For each , we have , and for each state-action pair . In addition, we denote as the sequence of transition kernels. For each , we have , where is a probability distribution over for each state-action pair .

The quantities vary across different ’s in general. Following (Besbes et al. 2014), we quantify the variation on in terms of their respective variation budgets :


We emphasize although and might be used as inputs by the DM, individual ’s and ’s are unknown to the DM throughout the current paper.

Dynamics. The DM faces an online non-stationary MDP instance . She knows , but not . The DM starts at an arbitrary state . At time , three events happen. First, the DM observes its current state . Second, she takes an action . Third, given , she stochastically transits to another state which is distributed as , and receives a stochastic reward , which is 1-sub-Gaussian with mean . In the second event, the choice of is based on a non-anticipatory policy . That is, the choice only depends on the current state and the previous observations .

Dynamic Regret. The DM aims to maximize the cumulative expected reward , despite the model uncertainty on and the non-stationarity of the learning environment. To measure the convergence to optimality, we consider an equivalent objective of minimizing the dynamic regret (Besbes et al. 2014, Jaksch et al. 2010)


In the offline benchmark , the summand is the optimal long term reward of the stationary MDP with transition kernel and mean reward The optimum can be computed by solving linear program (13) provided in Section A.1. {remark} When , (2) reduces to the definition (Besbes et al. 2014) of dynamic regret for non-stationary -armed bandits. Nevertheless, different from the bandit case, the offline benchmark does not equal to the expected optimum for the non-stationary MDP problem in general.

Next, we review relevant concepts on MDPs, in order to stipulate an assumption that ensures learnability and justifies our offline bnechmark. {definition}[Communicating MDPs and Diameters (Jaksch et al. 2010)] Consider a set of states , a collection of action sets, and a transition kernel . For any and stationary policy , the hitting time from to under is the random variable which can be infinite. We say that is a communicating MDP iff

is finite. The quantity is the diameter associated with . Throughout the paper, we make the following assumption. {assumption} For each , the tuple constitutes a communicating MDP with diameter at most . We denote the maximum diameter as

The following proposition justifies our choice of offline benchmark . {proposition} Consider an instance that satisfies Assumption 2 with maximum diameter , and has variation budgets for the rewards and transition kernels respectively. In addition, suppose that . It holds that

The maximum is taken over all non-anticipatory policies ’s. We denote as the trajectory under policy , where is determined based on and , and for each . The Proposition is proved in Appendix A.2. In fact, our dynamic regret bounds are larger than the error term , thus justifying the choice of as the offline benchmark. The offline benchmark is more convenient for analysis than the expected optimum, since the former can be decomposed to summations across different intervals, unlike the latter where the summands are intertwined (since ).

3 Related Works

Learning stationary un-discounted MDP has been studied in (Burnetas and Katehakis 1997, Bartlett and Tewari 2009, Jaksch et al. 2010, Agrawal and Jia 2017, Fruit et al. 2018a, b). For learning non-stationary MDPs, the stream of works (Dick et al. 2014, Cardoso et al. 2019) considered changing reward distributions but fixed transition kernels. (Yu and Mannor 2009, Yu et al. 2009) allowed arbitrary changes in reward, but bounded changes in the transition kernels, and design algorithms under Markov chain mixing assumptions. (Jaksch et al. 2010, Gajane et al. 2018) proposed solutions for the piecewise stationary setting. (Even-Dar et al. 2005, Abbasi-Yadkori et al. 2013, Li et al. 2019) considered learning MDPs with full information feedback in various adversarial and non-stationary environments. Episodic MDPs with adversarially changing rewards and stationary transition kernels are studied under full information feedback (Neu et al. 2012, Rosenberg and Mansour 2019a) and bandit feedback (Neu et al. 2010b, a, Arora et al. 2012, Rosenberg and Mansour 2019b, Jin et al. 2019). In (Nilim and Ghaoui 2005, Xu and Mannor 2006), robust control of non-stationary MDPs were studied.

In a parallel work, (Ortner et al. 2019) considered a similar setting to ours by applying the “forgetting principle” from non-stationary bandit settings (Garivier and Moulines 2011b, Cheung et al. 2019a) to design a learning algorithm. To achieve its dynamic regret bound, the algorithm by (Ortner et al. 2019) partitions the entire time horizon into time intervals and crucially requires the access to and i.e., the variations in both reward and state transition distributions of each interval (see Theorem 3 in (Ortner et al. 2019)). In contrast, the SWUCRL2-CW algorithm and the BORL algorithm require significantly less information on the variations. Specifically, the SWUCRL2-CW algorithm does not need any additional knowledge on the variations except for and i.e., the variation budgets over the entire time horizon as defined in eqn. (1), to achieve its dynamic regret bound (see Theorem 4.3). This is similar to algorithms for the drifting bandit settings, which only require the access to (Besbes et al. 2014). More importantly, the BORL algorithm (built upon the SWUCRL2-CW algorithm) enjoys the same dynamic regret bound even without knowing either of or (see Theorem 5.3).

For online learning and bandit problems where there is only one state, the works by (Auer et al. 2002a, Garivier and Moulines 2011a, Besbes et al. 2014, Keskin and Zeevi 2016, Russac et al. 2019) proposed several “forgetting” strategies for different settings. More recently, the works by (Jadbabaie et al. 2015, Karnin and Anava 2016, Luo et al. 2018, Cheung et al. 2019b, a, Chen et al. 2019b) designed parameter-free algorithms for non-stationary online learning.

4 Sliding Window UCRL2 with Confidence Widening

In this section, we present the SWUCRL2-CW algorithm, which incorporates sliding window estimates (Garivier and Moulines 2011a) and a novel confidence widening technique into UCRL2 (Jaksch et al. 2010).

4.1 Design Overview

The SWUCRL2-CW algorithm first specifies a sliding window parameter and a confidence widening parameter . Parameter specifies the number of previous time steps to look at. Parameter quantifies the amount of additional optimistic exploration, on top of the conventional optimistic exploration using upper confidence bounds. The former turns out to be necessary for handling the drifting non-stationarity of the transition kernel.

The algorithm runs in a sequence of episodes that partitions the time steps. Episode starts at time (in particular ), and ends at the end of time . Throughout an episode the DM follows a certain stationary policy The DM ceases the episode if at least one of the following two criteria is met:

  • The time index is a multiple of Consequently, each episode last for at most time steps. The criterion ensures that the DM switches the stationary policy frequently enough, in order to adapt to the non-stationarity of ’s and ’s.

  • There exists some state-action pair such that the number of time step ’s with within episode is at least as many as the total number of counts for it within the time steps prior to i.e., from to This is similar to the doubling criterion in (Jaksch et al. 2010), which ensures that each episode is sufficiently long so that the DM can focus on learning.

The combined effect of these two criteria allows the DM to learn a low dynamic regret policy with historical data from an appropriately sized time window. One important piece of ingredient is the construction of the policy for each episode . To allow learning under non-stationarity, the SWUCRL2-CW algorithm computes the policy based on the history in the time steps previous to the current episode i.e., from round to round . The construction of involves the Extended Value Iteration (EVI) (Jaksch et al. 2010), which requires the confidence regions for rewards and transition kernels as the inputs, in addition to an precision parameter . The confidence widening parameter is capable of ensuring the MDP output by the EVI has a bounded diameter most of the time.

4.2 Policy Construction

To describe SWUCRL2-CW algorithm, we first define for each state action pair and each time step in episode


Confidence Region for Rewards

For each state action pair and each time step in episode , we consider the empirical mean estimator

which serves to estimate the average reward

The confidence region is defined as


with confidence radius

Confidence Widening for Transition Kernels.

or each state action pair and each time step in episode , we consider the empirical mean estimator

which serves to estimate the average transition probability


Different from the case of estimating reward, the confidence region for the transition probability involves a widening parameter :


with confidence radius

With , the DM can explore transition kernels that deviate from the sample average, and the exploration is crucial for learning MDPs under non-stationarity. In a nutshell, the incorporation of provides an additional source of optimism. We treat as a hyper-parameter at the moment, and provide a suitable choice of when we discuss our main results.

Extended Value Iteration (EVI) (Jaksch et al. 2010).

The SWUCRL2-CW algorithm relies on the EVI, which solves MDPs with optimistic exploration to near-optimality. We extract and rephrase a description of EVI in Appendix A.3. EVI inputs the confidence regions for the rewards and the transition kernels. The algorithm outputs an “optimistic MDP model”, which consists of reward vector and transition kernel under which the optimal average gain is the largest among all :

  • Input: Confidence regions for , for and an error parameter

  • Output: The returned policy and the auxiliary output In the latter, and are the selected “optimistic” reward vector, transition kernel, and the corresponding long term average reward. The output is a bias vector (Jaksch et al. 2010). For each , the quantity is indicative of the short term reward when the DM starts at state and follows the optimal policy. By the design of EVI, for the output , there exists such that . Altogether, we express

Combining the three components, a formal description of the SWUCRL2-CW algorithm is shown in Algorithm 1.

Algorithm 1 SWUCRL2-CW algorithm
1:Input: Time horizon , state space and action space window size , confidence widening parameter
2:Initialize initial state
3:for episode  do
4:     Set , and according to eqn (3), for all .  
5:     Compute the confidence regions , according to eqns (4, 6). 
6:     Compute a -optimal optimistic policy :  
7:     while  is not a multiple of and  do
8:         Choose action observe reward and the next state
9:         Update . 
10:         if  then
11:              The algorithm is terminated.
12:         end if
13:     end while
14:end for

4.3 Performance Analysis: The Blessing of More Optimism

We now analyze the performance of the SWUCRL2-CW algorithm. First, we introduce two events which state that the estimated reward and transition kernels lie in the respective confidence regions.

We prove that hold with high probability. {lemma} We have , . The proof is in Section B of the appendix. In defining , the widening parameter is set to be 0, since we are only concerned with the estimation error on . Next, we bound the dynamic regret of each time step, under certain assumptions on . To facilitate our discussion, we define the following variation measure for each in an episode :


Consider an episode . Condition on events and suppose that there exists a transition kernel satisfying two properties: (1) we have , and (2) the diameter of at most . Then, for every in episode , we have


The complete proof can be found in Section C of the appendix. Unlike Lemma 4.3, the parameter plays an important role in the Proposition. As increases, the confidence region becomes larger for each , and the assumed diameter is expected to decrease. Our subsequent analysis shows that can be suitably calibrated so that . Next, we state our first main result, which provides a dynamic regret bound assuming the knowledge of to set : {theorem} Assuming the SWUCRL2-CW algorithm with window size and confidence widening parameter satisfies the dynamic regret bound

with probability . If we further put


this is


Proof Sketch. The complete proof is presented in Section D of the appendix. To facilitate the exposition, we denote as the total number of episodes. By abusing the notation, we let . Episode is interrupted and the algorithm is forced to terminate, as the end of time is reached. To proceed, we define the set

For each episode , we distinguish two cases:

  • Case 1. Under this situation, we apply Proposition 4.3 to bound the dynamic regret during the episode, using the fact that satisfies the assumptions of the proposition with .

  • Case 2. . In this case, we trivially upper bound the dynamic regret of each round in episode by

For case 1, we bound the dynamic regret during episode by summing the error terms in (7, 8) across the rounds in the episode. The term (7) accounts for the error by switching policies. In (8), the terms accounts for the estimation errors due to stochastic variations, and the term accounts for the estimation error due to non-stationarity.

For case 2, we need an upper bound on , the total number of rounds that belong to an episode in . The analysis is challenging, since the length of each episode may vary, and one can only guarantee that the length is . A first attempt could be to upper bound as , but the resulting bound appears too loose to provide any meaningful regret bound. Indeed, there could be double counting, as the starting time steps for a pair of episodes in case 2 might not even be rounds apart!

To avoid the trap of double counting, we consider a set where the start times of the episodes are sufficiently far apart, and relate the cardinality of to . The set is constructed sequentially, by examining all episodes in the time order. At the start, we initialize . For each , we perform the following. If episode satisfies both criteria:

  1. There exists some and such that

  2. For every

then we add into Afterwards, we move to the next episode index . The process terminates once we arrive at episode The construction ensures that, for each episode if for all then otherwise, would have been added into

By virtue of the confidence widening, we argue that every episode consumes at least variation budget It turns out that we can upper bound the cardinality of as and we also can upper bound


Proposition 4.3 states that if the confidence region contains a transition kernel that induces a MDP with bounded diameter the EVI supplied with can return a policy with controllable dynamic regret bound. However, as we show in Section 6, one in general cannot expect this to happen. Nevertheless, we bypass this with our novel confidence widening technique and a budget-aware analysis. We consider the first time step of each episode if for all then Proposition 4.3 can be leveraged; otherwise, the widened confidence region enforces that a considerable amount of variation budget is consumed. {remark} When , our problem becomes the non-stationary bandit problem studied by (Besbes et al. 2014), and we have and By choosing , our algorithm has dynamic regret , matching the minimax optimal dynamic regret bound by (Besbes et al. 2014) when {remark} Similar to (Cheung et al. 2019b, a), if are not known, we can set and obliviously as to obtain a dynamic regret bound

5 Bandit-over-Reinforcement Learning: Towards Parameter-Free

As pointed out by Remark 4.3, in the case of unknown and the dynamic regret of SWUCRL2-CW algorithm scales linearly in and . However, by Theorem 4.3, we are assured a fixed pair of parameters can ensure low dynamic regret. For the bandit setting, (Cheung et al. 2019a, b) propose the bandit-over-bandit framework that uses a separate copy of EXP3 algorithm to tune the window size. Inspired by it, we develop a novel Bandit-over-Reinforcement Learning (BORL) algorithm with parameter-free dynamic regret here.

5.1 Design Overview

Following a similar line of reasoning as (Cheung et al. 2019a), we make use of the SWUCRL2-CW algorithm as a sub-routine, and “hedge” (Bubeck and Cesa-Bianchi 2012) against the (possibly adversarial) changes of ’s and ’s to identify a reasonable fixed window size and confidence widening parameter.

As illustrated in Fig. 1, the BORL algorithm divides the whole time horizon into blocks of equal length rounds (the length of the last block can ), and specifies a set from which each pair of (window size, confidence widening) parameter are drawn from. For each block , the BORL algorithm first calls some master algorithm to select a pair of (window size, confidence widening) parameters , and restarts the SWUCRL2-CW algorithm with the selected parameters as a sub-routine to choose actions for this block. Afterwards, the total reward of block is fed back to the master, and the “posterior” of these parameters are updated accordingly.

One immediate challenge not presented in the bandit setting (Cheung et al. 2019b) is that the starting state of each block is determined by previous moves of the DM. Hence, the master algorithm is not facing a simple oblivious environment as the case in (Cheung et al. 2019b), and we cannot use the EXP3 (Auer et al. 2002a) algorithm as the master. Nevertheless, fortunately the state is observed before the starting of a block. Thus, we use the EXP3.P algorithm for multi-armed bandit against an adaptive adversary (Auer et al. 2002a, Bubeck and Cesa-Bianchi 2012) as the master algorithm. We follow the exposition in Section 3.2 in (Bubeck and Cesa-Bianchi 2012) for adapting the EXP3.P algorithm.

Figure 1: Structure of the BORL algorithm

5.2 Design Details

We are now ready to state the details of the BORL algorithm. For some fixed choice of block length (to be determined later), we first define a couple of additional notations:


Here, and are all possible choices of window size and confidence widening parameter, respectively, and is the Cartesian product of them with We also let be the total rewards for running the SWUCRL2-CW algorithm with window size and confidence widening parameter for block starting from state

The EXP3.P algorithm treats each element of as an arm. It begins by initializing


where At the beginning of each block the BORL algorithm first sees the state and computes


Then it sets with probability The selected pair of parameters are thus and Afterwards, the BORL algorithm starts from state selects actions by running the SWUCRL2-CW algorithm with window size and confidence widening parameter for each round in block At the end of the block, the BORL algorithm observes the total rewards As a last step, it rescales by dividing it by so that it is within and updates


The formal description of the BORL algorithm (with defined in the next subsection) is shown in Algorithm 2.

Algorithm 2 BORL algorithm
1:Input: Time horizon , state space and action space initial state
2:Initialize according to eqn. (9), and according to eqn. (10).
4:for  do
5:     Define distribution according to eqn. (11), and set with probability .
7:     for  do
8:         Run the SWUCRL2-CW algorithm with window size and blow up parameter and observe the total rewards
9:     end for
10:     Update according to eqn. (12).
11:end for

5.3 Performance Analysis

The dynamic regret guarantee of the BORL algorithm can be presented as follows {theorem} Assume with probability the dynamic regret bound of the BORL algorithm is The proof is provided in Section E of the appendix.

6 The Perils of Drift in Learning Markov Decision Processes

In online stochastic environments, one usually estimates a latent quantity by taking the time average of observed samples, even when the sample distribution varies across time. This has been proved to work well in non-stationary bandit settings (Garivier and Moulines 2011a, Cheung et al. 2019a, b). To extend to RL, it is natural to consider the sample average transition kernel , which uses the data in the previous rounds to estimate the time average transition kernel to within an additive error (see Section 4.2.3 and Lemma 4.3). In the case of stationary MDPs, where , one has Thus, the un-widened confidence region contains with high probability. Consequently, the UCRL2 algorithm by (Jaksch et al. 2010), which optimistic explores , has a regret that scales linearly with the diameter of .

The approach of optimistic exploring is further extended to RL in piecewise stationary MDPs by (Jaksch et al. 2010, Gajane et al. 2018). The latter establishes a dynamic regret bounds, when there are at most changes. Their analyses involve partitioning the -round horizon into equal-length intervals, where is a constant dependent on . At least intervals enjoy stationary environments, and optimistic exploring in these intervals yields a dynamic regret bound that scales linearly with . Bounding the dynamic regret of the remaining intervals by their lengths and tuning yield the desired bound.

In contrast to the stationary and piecewise-stationary settings, optimistic exploration on might lead to unfavorable dynamic regret bounds in drifting MDPs. In the drifting environment where are generally distinct, we show that it is impossible to bound the diameter of in terms of the maximum of the diameters of . More generally, we demonstrate the previous claim not only for , but also for every