1 Introduction
\OneAndAHalfSpacedXI\TheoremsNumberedThrough\ECRepeatTheorems\EquationsNumberedThrough\RUNAUTHOR

Cheung, Simchi-Levi, and Zhu

\RUNTITLE

Reinforcement Learning under Drift

\TITLE

Reinforcement Learning under Drift

\ARTICLEAUTHORS\AUTHOR

Wang Chi Cheung \AFFDepartment of Industrial Systems Engineering and Management, National University of Singapore \EMAILisecwc@nus.edu.sg \AUTHORDavid Simchi-Levi \AFFInstitute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139, \EMAILdslevi@mit.edu \AUTHORRuihao Zhu \AFFStatistics and Data Science Center, Massachusetts Institute of Technology, Cambridge, MA 02139, \EMAILrzhu@mit.edu

\ABSTRACT

We propose algorithms with state-of-the-art dynamic regret bounds for un-discounted reinforcement learning under drifting non-stationarity, where both the reward functions and state transition distributions are allowed to evolve over time. Our main contributions are: 1) A tuned Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence-Widening (SWUCRL2-CW) algorithm, which attains low dynamic regret bounds against the optimal non-stationary policy in various cases. 2) The Bandit-over-Reinforcement Learning (BORL) framework that further permits us to enjoy these dynamic regret bounds in a parameter-free manner.

\KEYWORDS

non-stationary reinforcement learning, Markov decision process, parameter-free algorithm

## 1 Introduction

Consider a discrete-time Markovian decision process (MDP) where a decision-maker (DM) interacts with a system iteratively: in each round, the DM first observes the current state of the system, and then picks an available action. Afterwards, it receives an instant random reward, and the system transits to the next state according to some state transition distribution. The reward distribution and the state transition distribution depend on the current state and the chosen action, but are independent of all the previous states and actions. The goal of the DM is to maximize its cumulative rewards under the following challenges:

• Uncertainty: the reward and the state transition distributions are initially unknown to the DM.

• Non-stationarity: the environment is non-stationary, and both of the reward distributions and the state transition distributions can evolve over time.

• Partial feedback: the DM can only observe the reward and state transition resulted by the current state and the chosen action in each round.

In fact, many applications, such as inventory control (Bertsekas 2017) and transportation (Zhang and Wang 2018, Qin et al. 2019), can be modeled by this general framework.

Under stationarity, this problem can be solved by the classical Upper-Confidence bound for Reinforcement Learning (UCRL2) algorithm (Jaksch et al. 2010). Unfortunately, the strategies for the stationary setting can deteriorate in non-stationarity environments as historical data “expires”. To address this shortcoming, (Jaksch et al. 2010, Gajane et al. 2018) further consider a switching MDP setting where the MDP is piecewise-stationary, and propose solutions for it. Under the special case of multi-armed bandit (MAB), where the MDP has only one state, there is a recent stream of research initiated by (Besbes et al. 2014) that studies the so-called drifting environment (Karnin and Anava 2016, Luo et al. 2018, Cheung et al. 2019, Chen et al. 2019), in which the reward of each action can change arbitrarily over time, but the total change (quantified by a suitable metric) is upper bounded by some variation budget (Besbes et al. 2014). The aim is to minimize the dynamic regret, the optimality gap compared to the cumulative rewards of the sequence of optimal actions.

In this paper, we generalize the concept of “drift” from MAB to RL, i.e., the reward and state transition distributions can shift gradually as long as their rounds total changes are bounded by and respectively. We then design and analyze novel algorithms for RL in a drifting environment. Let and be the number of states and actions for the MDP, our main contributions are listed as follows.

• When the variation budgets are known, we develop a Sliding Window UCRL2 with Confidence Widening (SWUCRL2-CW) algorithm with dynamic regret bound under the time-invariant variation budget assumption that is similar to, but less restive than (Yu and Mannor 2009).

• We identify a unique challenge in RL under drift: existing works for switching MDP settings (Jaksch et al. 2010, Gajane et al. 2018) estimate unknown parameters by averaging historical data in a “forgetting” fashion, and crucially exploit the piecewise stationary environment to achieve low regret. But in non-stationary settings in general, the diameter (a complexity measure to be defined in Section 3) of the MDP estimated in this manner can grow wildly, and may result in unfavorable dynamic regret bounds for drifting environments due to a lack of piecewise stationarity. This is further discussed in Section 4.4 and Section E. We overcome this with our novel confidence widening technique when the variations are uniformly bounded over time. We also show that one can bypass this difficulty for many realistic cases stemmed from inventory control or queuing systems under mild assumption.

• When the variation budgets are unknown, we propose the Bandit-over-Reinforcement Learning (BORL) framework that tunes the SWUCRL2-CW algorithm adaptively, and hence enjoys a parameter free dynamic regret bound.

## 2 Related Works

Learning un-discounted MDPs under stationarity has been studied in (Bartlett and Tewari 2009, Jaksch et al. 2010, Agrawal and Jia 2017, Fruit et al. 2018a, b) among others. For non-stationary MDPs, the stream of works (Even-Dar et al. 2005, Nilim and Ghaoui 2005, Xu and Mannor 2006, Dick et al. 2014) consider settings with either changing reward functions or transition kernels, but not both. In contrast, (Yu and Mannor 2009) allows arbitrary changes in reward, but (globally) bounded changes in the transition kernels, and design algorithms under additional Markov chain mixing assumptions. (Jaksch et al. 2010, Gajane et al. 2018) proposes solutions for the switching setting. (Abbasi-Yadkori et al. 2013) consider learning MDPs in an adversarial environment with full information feedback.

For online learning and bandit problems where there is only one state, the works by (Auer et al. 2002, Garivier and Moulines 2011, Besbes et al. 2014, Keskin and Zeevi 2016) propose several “forgetting” strategies for different settings. More recently, the works by (Jadbabaie et al. 2015, Karnin and Anava 2016, Luo et al. 2018, Cheung et al. 2019, Chen et al. 2019) design parameter-free algorithms for non-stationary online learning.

## 3 Problem Formulation

An instance of non-stationary oline MDP is specified by the tuple . The set is a finite set of states. The collection contains a finite action set for each state . The quantity is the total number of time steps. The quantity is a sequence of mean rewards. For each , we have , where for each . The quantity is a sequence of transition kernels. For each , we have , where is a probability distribution over for each . The quantities vary across different ’s in general. We elaborate on their temporal variations in Section 4 when we provide the performance guarantee of our algorithm according to the degree of non-stationarity.

Dynamics. We consider a DM who faces an online non-stationary MDP instance . The DM knows , but not . It starts at an arbitrary state . At time , three events happen. First, the DM observes its current state . Second, it takes an action . Third, given , it stochastically transits to another state which is distributed as , and received a stochastic reward , which is 1-sub-Gaussian with mean . In the second event, the choice of is based on a non-anticipatory policy . That is, the choice only depends on the current state and the previous observations .

Objective. The DM aims to maximize the cumulative expected reward , despite the model uncertainty on ’s and the non-stationarity of the learning environment. To measure the convergence of an online algorithm to optimality, we consider an equivalent objective of dynamic regret minimization. For each time , let be the optimum long term reward for the online MDP problem with stationary transition kernel and stationary mean reward function To compute we refer to the linear programming formulation (18) in Section A. The DM aims to design policy to minimize the dynamic regret

 Dyn-RegT(Π)=T∑t=1{ρ∗t−E[rt(st,at)]}.
{remark}

We note that this regret notion recovers the dynamic regret definition in bandit setting when specified to bandits. Different from bandit, the expected total reward received could be much higher than the benchmark as the latter does not take into account the starting state. We also review the relevant concepts of communicating MDPs and their diameters. {definition}[Hitting Time, Communicating MDPs and Diameters] Consider a set of states , a collection of action sets, and a transition kernel . For any and any stationary policy , the hitting time from to under is the random variable

 Λ(s′|π,s):=min{t:st+1=s′,s1=s,sτ+1∼¯p(⋅|sτ,π(sτ)) ∀τ},

which can be infinite. We say that constitutes a communicating MDP if and only if

 D:=maxs,s′∈Sminstationary πE[Λ(s′|π,s)]

is finite. The quantity is the diameter associated with . To enable learning in a non-stationary environment, we make the following reachability assumption. {assumption} For time , the tuple constitutes a communicating MDP with diameter at most . We denote .

## 4 Sliding Window UCRL2 with Confidence Widening

In this section, we present the SWUCRL2-CW algorithm, which incorporates sliding window estimates (Garivier and Moulines 2011) and a novel confidence widening technique into the UCRL2 algorithm (Jaksch et al. 2010).

### 4.1 Design Overview

The SWUCRL2-CW algorithm first specifies a window length and runs in a sequence of episodes that partitions the time steps. Episode starts at round (in particular ), and ends at the end of round . Throughout an episode the DM follows a certain stationary policy It ceases the episode if at least one of the following two criteria is met:

• The round index is a multiple of This ensures that each episode is at most time steps long, and prevents choosing actions based on out-of-date information

• There exists some (state,action) pair such that the number of visits to them within episode is at least as many as the total number of visits to them within the rounds prior to i.e., rounds This is similar to the doubling criterion in (Jaksch et al. 2010), which ensures that each episode is sufficiently long so that the DM can focus on learning.

The combined effect of these two criteria allows the DM to learn a near-optimal policy with historical data from an appropriately sized time window. One important piece of ingredient is the construction of the policy for each episode. To allow learning under non-stationarity, the SWUCRL2-CW algorithm computes the policy for the episode based on the history in the previous rounds before the current episode i.e., between round and . The construction of involves the Extended Value Iteration (EVI) (Jaksch et al. 2010), which requires the confidence/uncertainty regions for rewards and transition kernels as the inputs, in addition to an error parameter . We note that the parameter is a certain confidence widening parameter, which contributes to ensure the output MDP of the EVI has a bounded diameter.

### 4.2 Policy Construction

For ease of discussion, we first define for each state-action pair and each belongs to episode

 Nt(s,a):=t−1∑q=(τ(m)−W)∨11((sq,aq)=(s,a)),N+t(s,a):=max{1,Nt(s,a)}. (1)

1) Confidence Region for Rewards. For each state-action pair and each belongs to episode , we consider the empirical mean estimator

 ^rt(s,a):=1N+t(s,a)⎛⎝t−1∑q=(τ(m)−W)∨1Rq(sq,aq)1(sq=s,aq=a)⎞⎠,

which has mean

 ¯rt(s,a):=1N+t(s,a)⎛⎝t−1∑q=(τ(m)−W)∨1rq(s,a)1(sq=s,aq=a)⎞⎠.

The confidence region is with

where .

2) Confidence Region and Confidence Widening for Transition Kernels. For each and each time in episode , we also consider the empirical mean estimator

 ^pt(s′|s,a):=1N+t(s,a)⎛⎝t−1∑q=(τ(m)−W)∨11(sq=s,aq=a,sq+1=s′)⎞⎠. (3)

which has mean

 ¯pt(s′|s,a):=1N+t(s,a)t−1∑q=(τ(m)−W)∨1pq(s′|s,a)1(sq=s,aq=a). (4)

The confidence region is with

where .

3) Extended Value Iteration (EVI) (Jaksch et al. 2010). The SWUCRL2-CW algorithm relies on the EVI, which solves MDPs with optimistic exploration to near-optimality. We extract (and rephrase) a description of EVI in Appendix A.1. EVI inputs the confidence regions for the rewards and the transition kernels. It then outputs an “optimistic MDP model”, which consists of reward vector and transition kernel under which the optimal average gain is the largest among all reward vectors and transition kernels in the supplied confidence regions. Specifically, it works as follows:

• Input: confidence regions for , for and an arbitrarily small error parameter

• Output: The returned policy and the auxiliary output Here, and are the selected “optimistic” reward vector, transition kernel, and the corresponding long term average reward; While is the bias vector (Jaksch et al. 2010). Collectively, we express . Note that but for some .

The output of the EVI further satisfy the following two properties.

###### Property 1

The dual variables is optimistic, i.e.,

 ~ρ+~γ(s)≥max˙r(s,a)∈Hr(s,a){˙r(s,a)}+∑s′∈Sγ(s′)max˙p∈Hp(s,a){˙p(s′|s,a)}.
###### Property 2

For each state , we have

 ~r(s,~π(s))≥~ρ+~γ(s)−∑s′∈S~p(s′|s,~π(s))~γ(s′)−ϵ.

Consequently, is an optimal (up to an additive factor of ) deterministic policy for MDP

With these, the formal description of the SWUCRL2-CW algorithm is shown in Algorithm 1.

### 4.3 Performance Analysis

We are now ready to analyze the performance of the SWUCRL2-CW algorithm. As we are under a changing environment, the dynamic regret bound of SWUCRL2-CW algorithm shall adapt to the underlying shifts (Besbes et al. 2014). We thus define the following variation measures for the drifts in rewards and transition kernels across time:

 Br=T−1∑t=1Br,t,Br,t ≥maxs∈S,a∈As{|rt+1(s,a)−rt(s,a)|}, (6) Bp=T−1∑t=1Bp,t,Bp,t ≥maxq∈[T−1],s∈S,a∈As{∥pq+1(⋅|s,a)−pq(⋅|s,a)∥1}. (7)

One can interpret the quantities as an upper bound on the variations in rewards and transition kernels between round and , and the quantities endnote: Clearly, we cannot hope to achieve a dynamic regret sublinear in if or is , so we focus on the case when . are the total allowable variations for rewards and transition kernels in rounds. Similar to (but less restrictive than) (Yu and Mannor 2009), we further make the following assumption to imposes the transition kernel changes slowly at a steady rate across time. {assumption} The transition kernel’s variation budgets are uniform over time, i.e.,

 Bp,1=…=Bp,T−1.

Together with the confidence widening technique, we are guaranteed the resulted MDP of the EVI has a bounded diameter. For ease of exposition, we also define the piecewise variations for each belongs to episode

 varr,t:=t−1∑q=τ(m)−WBr,q,varp,t:=Bp,1(t−τ(m)+W2).

To proceed, we introduce two events which state that the estimated reward and transition kernels lie in the confidence region.

 Er:={¯rt(s,a)∈Hr,t(s,a) ∀s,a,t},Ep:={¯pt(⋅|s,a)∈Hp,t(s,a;0) ∀s,a,t}.

We prove that hold with high probability. {lemma} We have , . The proof is in Section B of the appendix. We then bound the dynamic regret of each round. {proposition} Conditioning on events and assuming that , for every episode and every time in the episode (i.e. ) the following inequality holds for certainty:

###### Proof.

Proof Sketch. The complete proof can be found in Section C of the appendix. Consider time , which belongs episode . In a high level, the proof goes through three steps:

1. The estimated mean reward (used in computing ) and the true reward differ by at most

2. With the widened confidence regions we are guaranteed that there exists a MDP, i.e., with diameter at most in By Lemma A and the dual formulation of optimal reward (19), we have the optimistic long term reward induced by is at least

3. Finally, we relate the optimistic transition kernel and the underlying kernel to account for the extra loss due to a widened confidence region. Unifying all three steps, we can conclude the statement.

Combining the above, we can conclude the statement. \halmos

Suppose we denote , and use the notation to hide logarithmic dependence on and the confidence parameter for defining the confidence regions , our first main result is a dynamic regret upper bound on the SWUCRL2-CW algorithm. {theorem} Assuming the SWUCRL2-CW algorithm with window size and satisfies the dynamic regret bound

 Dyn-RegT(SWUCRL2-CW)=~O(BrW+Dmax[BpW+S√AT√W+Tη+SATW+√T]) (8)

with probability . If we further put

 W=W∗=S23A13T23/(Br+Bp+1)23, (9) η=η∗=(Bp+1)W2T=S23A13(Bp+1)2T13(Br+Bp+1)23, (10)

this is

 ~O(Dmax(Br+Bp+1)13S23A13T23).

The complete proof of Theorem 4.3 is presented in Section D of the appendix. {remark} When , our problem model specializes to the non-stationary bandit problem studied by (Besbes et al. 2014). In this case, we have , and we are left with the first term in (8). By choosing , our algorithm has dynamic regret , matching the minimax optimal dynamic regret bound by (Besbes et al. 2014). {remark} Different from (Cheung et al. 2019), there is no straightforward way of setting to get a non-trivial bound when are not known. While (Cheung et al. 2019) provide a way to set their window size (oblivious of their variational budget ) so that the dynamic regret is in their linear bandit setting, in our setting we still need to have , which is a prior not clear how to ensure when we don’t know .

### 4.4 Uniform Variation Budget, Confidence Widening, and Alternatives

We now pause for a while to comment on Assumption 4.3 and the technique of confidence widening.

In online stochastic environment, one usually take time average of observed samples to estimate a certain latent quantity, even when the sample distributions vary with time. This has been proved to work well in the non-stationary bandit settings Garivier and Moulines (2011), Cheung et al. (2019). For online MDPs, one typically look at the time average MDP in (3), which estimate in (4) to within an additive error for any pair of . In the case of stationary MDPs where , one has and thus can conclude that the un-widened confidence region contains with high probability. An immediate consequence is that the EVI w.r.t. would return a policy with bounded difference bias vector as has diameter (Please see Section 4.3 of (Jaksch et al. 2010)). These further ensure that the optimistic long term average reward is not far away from the true long term average reward, e.g., step 2) in the proof of Proposition 4.3.

Nevertheless, this is not always the case under non-stationarity. Although (Jaksch et al. 2010, Gajane et al. 2018) uses for piecewise stationary MDP setting, they crucially exploit the fact that the MDP remain unchanged between jumps, and can treat the problem as if it is stationary. For changing environments, one can only guarantee that the transition kernel with high probability, but unsure about the true ’s due to the drift. In Section E.1 of the appendix, we show that the diameter of can grow as and the EVI w.r.t. can only promise a policy with bias vector bounded by which makes the dynamic regret bound vacuous for drifting environments. By assumption 4.3 and the confidence widening technique, we are guaranteed that for each episode and can proceed as what we have done.

Alternatively, if there exists such that for any states there is always an action with then the MDP has diameter and it can shown that SWUCRL2-CW algorithm with enjoys a dynamic regret bound by similar techniques. As we shall shown in Section E.2, this assumption can be easily satisfied in many realistic applications.

## 5 Bandit-over-Reinforcement Learning: Towards Parameter-Free

As pointed out by Remark 4.3, in the case of unknown and the DM cannot implement the SWUCRL2-CW algorithm as the magnitude of confidence widening cannot be determined. To handle this case, we wish to design an online algorithm that can attain reasonable dynamic regret bound in a parameter-free manner. By Theorem 4.3, we are assured: under Assumption 4.3, a fixed pair of parameters can ensure low regret. For the bandit setting, (Cheung et al. 2019) proposes the bandit-over-bandit framework that uses a separate copy of EXP3 algorithm to tune the window length. Inspired by it, we develop a novel Bandit-over-Reinforcement Learning (BORL) algorithm with parameter-free dynamic regret here.

### 5.1 Design Overview

Following a similar line of reasoning as (Cheung et al. 2019), we make use of the SWUCRL2-CW algorithm as a sub-routine, and “hedge” (Bubeck and Cesa-Bianchi 2012) against the (possibly adversarial) changes of ’s and ’s to identify a reasonable fixed window length and confidence widening parameter.

As illustrated in Fig. 1, the BORL algorithm divides the whole time horizon into blocks of equal length rounds (the length of the last block can ), and specifies a set from which each pair of (window length, confidence widening) parameter are drawn from. For each block , the BORL algorithm first calls some master algorithm to select a pair of (window length, confidence widening) parameters , and restarts the SWUCRL2-CW algorithm with the selected parameters as a sub-routine to choose actions for this block. Afterwards, the total reward of block is fed back to the master, and the “posterior” of these parameters are updated accordingly.

One immediate challenge not presented in the bandit setting (Cheung et al. 2019) is that the starting state of each block is determined by previous moves of the DM. Hence, the master algorithm is not facing a simple oblivious environment as the case in bandit setting where there is only one sate, but fortunately, the state is observed before the starting of a block. To this end, we use the EXP3.P algorithm for multi-armed bandit against an adaptive adversary (Auer et al. 2002, Bubeck and Cesa-Bianchi 2012) as the master algorithm.

### 5.2 Design Details

We are now ready to state the details of the BORL algorithm. For some fixed choice of block length (to be determined later), we first define a couple of additional notations:

 (11)

Here, and are all possible choices of window length and confidence widening parameter, respectively, and is the Cartesian product of them with We emphasize that due to the restarting, any instance of the SWUCRL2-CW algorithm cannot last for more than rounds. Consequently, even if the EXP3.P selects a window length the effective window length is and we make We also let be the total rewards for running the SWUCRL2-CW algorithm with window length and confidence widening parameter for rounds starting from state

The EXP3.P algorithm (Bubeck and Cesa-Bianchi 2012) treats each element of as an arm. It begins by initializing

 α=0.95√lnΔΔ⌈T/H⌉,β=√lnΔΔ⌈T/H⌉,γ=1.05√ΔlnΔ⌈T/H⌉,q(j,k),1=0  ∀ (j,k)∈M, (12)

where At the beginning of each block the BORL algorithm first sees the state and computes

 ∀ (j,k)∈M,u(j,k),i=(1−γ)exp(αq(j,k),i)∑(j′,k′)∈Mexp(αq(j′,k′),i)+γΔ. (13)

Then it sets with probability The selected pair of parameters are thus and Afterwards, the BORL algorithm starts from state selects actions by running the SWUCRL2-CW algorithm with window length and confidence widening parameter for each round in block At the end of the block, the BORL algorithm observes the total rewards As a last step, it rescales by dividing it by so that it is within and updates

 ∀ (j,k)∈M   q(j,k),i+1=q(j,k),i+β+1(j,k)=(ji,ki)⋅R(Wi,ηi,s(i−1)H+1)/Hu(j,k),i. (14)

The formal description of the BORL algorithm (with defined in the next subsection) is shown in Algorithm 2.

### 5.3 Performance Analysis

To analyze the performance of the BORL algorithm, we consider the following regret decomposition, for any choice of we have

 Dyn-Reg(BORL,T)= = ⌈T/H⌉∑i=1E⎡⎣i⋅H∧T∑t=(i−1)H+1ρ∗t−R(W†,η†,s(i−1)H+1)⎤⎦ + ⌈T/H⌉∑i=1E⎡⎣⌈T/H⌉∑i=1R(W†,η†,s(i−1)H+1)−R(Wi,ηi,s(i−1)H+1)⎤⎦. (15)

For the first term in eq. (15), we can apply the results from Theorem 4.3 to each block i.e.,

 i⋅H∧T∑t=(i−1)H+1[ρ∗t−R(W†,η†,s(i−1)H+1)] = ~O(Br(i)W†+Dmax[Bp(i)W†+S√AH/√W†+Hη†+SAH/W†+√H]), (16)

where we have defined

 Br(i):=⎛⎝i⋅H∧T∑t=(i−1)H+1Br,t⎞⎠,Bp(i)=⎛⎝i⋅H∧T∑t=(i−1)H+1Bp,t⎞⎠

for brevity. For the second term, it captures the additional rewards of the DM were it uses the fixed parameters throughout w.r.t. the trajectory on the starting states of each block by the BORL algorithm, i.e., and this is exactly the regret of the EXP3.P algorithm when it is applied to a -arm adaptive adversarial bandit problem with reward from Therefore, for any choice of we can upper bound this by

 ~O(H√ΔT/H)=~O(√TH)

as Summing these two, the regret of the BORL algorithm is

 Dyn-RegT(BORL)=~O(BrW†+Dmax[BpW†+S√AT√W†+Tη†+SATW†+√TH]). (17)

We now point out a key trade-off in the choice of

• On one hand, should be small enough so that the regret bound in eq. (17) is small.

• On the others, should be large to allow to get close to even when is small.

To this end, we pick

 H=⌈S√AT⌉.

We also justify the choice of and formalize the dynamic regret bound of the BORL algorithm as follows. {theorem} Assume that the dynamic regret bound of the BORL algorithm is

 ~O(Dmax(Br+Bp+1)13S23A13T23+DmaxS12A14T34)

with probability The complete proof can be found in Section F of the Appendix.

## References

• Abbasi-Yadkori et al. (2013) Abbasi-Yadkori, Yasin, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvári. 2013. Online learning in markov decision processes with adversarially chosen transition probability distributions. Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS).
• Agrawal and Jia (2017) Agrawal, Shipra, Randy Jia. 2017. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, eds., Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 1184–1194.
• Auer et al. (2002) Auer, P., N. Cesa-Bianchi, Y. Freund, R. Schapire. 2002. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002, Vol. 32, No. 1 : pp. 48–77.
• Bartlett and Tewari (2009) Bartlett, Peter L., Ambuj Tewari. 2009. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating mdps. UAI 2009, Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009. 35–42.
• Bertsekas (2017) Bertsekas, Dimitri. 2017. Dynamic Programming and Optimal Control. Athena Scientific.
• Besbes et al. (2014) Besbes, Omar, Yonatan Gur, Assaf Zeevi. 2014. Stochastic multi-armed bandit with non-stationary rewards. Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS).
• Bubeck and Cesa-Bianchi (2012) Bubeck, S., N. Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning, 2012, Vol. 5, No. 1: pp. 1–122.
• Chen et al. (2019) Chen, Yifang, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei. 2019. A new algorithm for non-stationary contextual bandits: Efficient, optimal, and parameter-free. Proceedings of Conference on Learning Theory (COLT).
• Cheung et al. (2019) Cheung, Wang Chi, David Simchi-Levi, Ruihao Zhu. 2019. Learning to optimize under non-stationarity. Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS).
• Dick et al. (2014) Dick, Travis, András György, Csaba Szepesvári. 2014. Online learning in markov decision processes with changing cost sequences. Proceedings of the International Conference on Machine Learning (ICML).
• Even-Dar et al. (2005) Even-Dar, Eyal, Sham M Kakade, , Yishay Mansour. 2005. Experts in a markov decision process. Proceedings of the 19th Annual Conference on Neural Information Processing Systems (NIPS).
• Fruit et al. (2018a) Fruit, Ronan, Matteo Pirotta, Alessandro Lazaric. 2018a. Near optimal exploration-exploitation in non-communicating markov decision processes. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett, eds., Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2998–3008.
• Fruit et al. (2018b) Fruit, Ronan, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner. 2018b. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. Jennifer Dy, Andreas Krause, eds., Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80. PMLR, StockholmsmÃ¤ssan, Stockholm Sweden, 1578–1586.
• Gajane et al. (2018) Gajane, Pratik, Ronald Ortner, Peter Auer. 2018. A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions. CoRR abs/1805.10066.
• Garivier and Moulines (2011) Garivier, A., E. Moulines. 2011. On upper-confidence bound policies for switching bandit problems. Proceedings of International Conferenc on Algorithmic Learning Theory (ALT).
• Hoeffding (1963) Hoeffding, Wassily. 1963. Probability inequalities for sums of bounded random variables. Journal of the American statistical association 58(301) 13–30.
• Jadbabaie et al. (2015) Jadbabaie, Ali, Alexander Rakhlin, Shahin Shahrampour, Karthik Sridharan. 2015. Online optimization : Competing with dynamic comparators. Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS).
• Jaksch et al. (2010) Jaksch, Thomas, Ronald Ortner, Peter Auer. 2010. Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11 1563–1600.
• Karnin and Anava (2016) Karnin, Z., O. Anava. 2016. Multi-armed bandits: Competing with optimal sequences. Procedding of Annual Conference on Neural Information Processing Systems (NIPS).
• Keskin and Zeevi (2016) Keskin, N., A. Zeevi. 2016. Chasing demand: Learning and earning in a changing environments. Mathematics of Operations Research, 2016, 42(2), 277–307.
• Lattimore and Szepesvári (2018) Lattimore, T., C. Szepesvári. 2018. Bandit Algorithms. Cambridge University Press.
• Luo et al. (2018) Luo, H., C. Wei, A. Agarwal, J. Langford. 2018. Efficient contextual bandits in non-stationary worlds. Proceedings of Conference on Learning Theory (COLT).
• Nilim and Ghaoui (2005) Nilim, Arnab, Laurent El Ghaoui. 2005. Robust control of markov decision processes with uncertain transition matrices. Operations Research.
• Qin et al. (2019) Qin, Zhiwei (Tony), Jian Tang, Jieping Ye. 2019. Deep reinforcement learning with applications in transportation. Tutorial of the 33rd AAAI Conference on Artificial Intelligence (AAAI-19).
• Weissman et al. (2003) Weissman, Tsachy, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, , Marco L. Weinberger. 2003. Inequalities for the l1 deviation of the empirical distribution. Technical Report HPL-2003-97, HP Laboratories Palo Alto: www.hpl.hp.com/techreports/2003/HPL-2003-97R1..
• Xu and Mannor (2006) Xu, Huan, Shie Mannor. 2006. The robustness-performance tradeoff in markov decision processes. Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS).
• Yu and Mannor (2009) Yu, Jia Yuan, Shie Mannor. 2009. Online learning in markov decision processes with arbitrarily changing rewards and transitions. Proceedings of the International Conference on Game Theory for Networks.
• Zhang and Wang (2018) Zhang, Anru, Mengdi Wang. 2018. Spectral state compression of markov processes.
{APPENDIX}

Supplementary

## Appendix A Supplementary Details about MDPs

The optimal long term reward is equal to the optimal value of the linear program . For a reward vector and a transition kernel , we define

 P(r,p)max ∑s∈S,a∈Asr(s,a)x(s,a) (18) s.t. ∑a∈Asx(s,a)=∑s′∈S,a′∈As′p(s|s′,a′)x(s′,a′) ∀s∈S ∑s∈S,a∈Asx(s,a)=1 x(s,a)≥0 ∀s∈S,a∈As

Throughout our analysis, it is useful to consider the following dual formulation of the optimization problem :

 D(r,p)min ρ (19) s.t. ρ+γ(s)≥r(s,a)+∑s′∈Sp(s′|s,a)γ(s′) ∀s∈S,a∈As ϕ,γ(s) free ∀s∈S.

The following Lemma shows that any feasible solution to is essentially bounded if the underlying MDP is communicating, which will be crucial in the subsequent analysis. {lemma} Let be a feasible solution to the dual problem , where consititute a communicating MDP with diameter . We have

 maxs,s′∈S{γ(s)−γ(s′)}≤2D.

The Lemma is extracted from Section 4.3.1 of (Jaksch et al. 2010), and it is more general than (Lattimore and Szepesvári 2018), which requires to be optimal instead of just feasible.

### a.1 Extended Value Iteration by (Jaksch et al. 2010)

We provide the pseudo-codes of the extended value iteration proposed by (Jaksch et al. 2010), displayed in Algorithm 3. By (Jaksch et al. 2010), the algorithm converges in finite time when