Cheung, SimchiLevi, and Zhu
Reinforcement Learning under Drift
Reinforcement Learning under Drift
Wang Chi Cheung \AFFDepartment of Industrial Systems Engineering and Management, National University of Singapore \EMAILisecwc@nus.edu.sg \AUTHORDavid SimchiLevi \AFFInstitute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139, \EMAILdslevi@mit.edu \AUTHORRuihao Zhu \AFFStatistics and Data Science Center, Massachusetts Institute of Technology, Cambridge, MA 02139, \EMAILrzhu@mit.edu
We propose algorithms with stateoftheart dynamic regret bounds for undiscounted reinforcement learning under drifting nonstationarity, where both the reward functions and state transition distributions are allowed to evolve over time. Our main contributions are: 1) A tuned Sliding Window UpperConfidence bound for Reinforcement Learning with ConfidenceWidening (SWUCRL2CW) algorithm, which attains low dynamic regret bounds against the optimal nonstationary policy in various cases. 2) The BanditoverReinforcement Learning (BORL) framework that further permits us to enjoy these dynamic regret bounds in a parameterfree manner.
nonstationary reinforcement learning, Markov decision process, parameterfree algorithm
1 Introduction
Consider a discretetime Markovian decision process (MDP) where a decisionmaker (DM) interacts with a system iteratively: in each round, the DM first observes the current state of the system, and then picks an available action. Afterwards, it receives an instant random reward, and the system transits to the next state according to some state transition distribution. The reward distribution and the state transition distribution depend on the current state and the chosen action, but are independent of all the previous states and actions. The goal of the DM is to maximize its cumulative rewards under the following challenges:

Uncertainty: the reward and the state transition distributions are initially unknown to the DM.

Nonstationarity: the environment is nonstationary, and both of the reward distributions and the state transition distributions can evolve over time.

Partial feedback: the DM can only observe the reward and state transition resulted by the current state and the chosen action in each round.
In fact, many applications, such as inventory control (Bertsekas 2017) and transportation (Zhang and Wang 2018, Qin et al. 2019), can be modeled by this general framework.
Under stationarity, this problem can be solved by the classical UpperConfidence bound for Reinforcement Learning (UCRL2) algorithm (Jaksch et al. 2010). Unfortunately, the strategies for the stationary setting can deteriorate in nonstationarity environments as historical data “expires”. To address this shortcoming, (Jaksch et al. 2010, Gajane et al. 2018) further consider a switching MDP setting where the MDP is piecewisestationary, and propose solutions for it. Under the special case of multiarmed bandit (MAB), where the MDP has only one state, there is a recent stream of research initiated by (Besbes et al. 2014) that studies the socalled drifting environment (Karnin and Anava 2016, Luo et al. 2018, Cheung et al. 2019, Chen et al. 2019), in which the reward of each action can change arbitrarily over time, but the total change (quantified by a suitable metric) is upper bounded by some variation budget (Besbes et al. 2014). The aim is to minimize the dynamic regret, the optimality gap compared to the cumulative rewards of the sequence of optimal actions.
In this paper, we generalize the concept of “drift” from MAB to RL, i.e., the reward and state transition distributions can shift gradually as long as their rounds total changes are bounded by and respectively. We then design and analyze novel algorithms for RL in a drifting environment. Let and be the number of states and actions for the MDP, our main contributions are listed as follows.

When the variation budgets are known, we develop a Sliding Window UCRL2 with Confidence Widening (SWUCRL2CW) algorithm with dynamic regret bound under the timeinvariant variation budget assumption that is similar to, but less restive than (Yu and Mannor 2009).

We identify a unique challenge in RL under drift: existing works for switching MDP settings (Jaksch et al. 2010, Gajane et al. 2018) estimate unknown parameters by averaging historical data in a “forgetting” fashion, and crucially exploit the piecewise stationary environment to achieve low regret. But in nonstationary settings in general, the diameter (a complexity measure to be defined in Section 3) of the MDP estimated in this manner can grow wildly, and may result in unfavorable dynamic regret bounds for drifting environments due to a lack of piecewise stationarity. This is further discussed in Section 4.4 and Section E. We overcome this with our novel confidence widening technique when the variations are uniformly bounded over time. We also show that one can bypass this difficulty for many realistic cases stemmed from inventory control or queuing systems under mild assumption.

When the variation budgets are unknown, we propose the BanditoverReinforcement Learning (BORL) framework that tunes the SWUCRL2CW algorithm adaptively, and hence enjoys a parameter free dynamic regret bound.
2 Related Works
Learning undiscounted MDPs under stationarity has been studied in (Bartlett and Tewari 2009, Jaksch et al. 2010, Agrawal and Jia 2017, Fruit et al. 2018a, b) among others. For nonstationary MDPs, the stream of works (EvenDar et al. 2005, Nilim and Ghaoui 2005, Xu and Mannor 2006, Dick et al. 2014) consider settings with either changing reward functions or transition kernels, but not both. In contrast, (Yu and Mannor 2009) allows arbitrary changes in reward, but (globally) bounded changes in the transition kernels, and design algorithms under additional Markov chain mixing assumptions. (Jaksch et al. 2010, Gajane et al. 2018) proposes solutions for the switching setting. (AbbasiYadkori et al. 2013) consider learning MDPs in an adversarial environment with full information feedback.
For online learning and bandit problems where there is only one state, the works by (Auer et al. 2002, Garivier and Moulines 2011, Besbes et al. 2014, Keskin and Zeevi 2016) propose several “forgetting” strategies for different settings. More recently, the works by (Jadbabaie et al. 2015, Karnin and Anava 2016, Luo et al. 2018, Cheung et al. 2019, Chen et al. 2019) design parameterfree algorithms for nonstationary online learning.
3 Problem Formulation
An instance of nonstationary oline MDP is specified by the tuple . The set is a finite set of states. The collection contains a finite action set for each state . The quantity is the total number of time steps. The quantity is a sequence of mean rewards. For each , we have , where for each . The quantity is a sequence of transition kernels. For each , we have , where is a probability distribution over for each . The quantities vary across different ’s in general. We elaborate on their temporal variations in Section 4 when we provide the performance guarantee of our algorithm according to the degree of nonstationarity.
Dynamics. We consider a DM who faces an online nonstationary MDP instance . The DM knows , but not . It starts at an arbitrary state . At time , three events happen. First, the DM observes its current state . Second, it takes an action . Third, given , it stochastically transits to another state which is distributed as , and received a stochastic reward , which is 1subGaussian with mean . In the second event, the choice of is based on a nonanticipatory policy . That is, the choice only depends on the current state and the previous observations .
Objective. The DM aims to maximize the cumulative expected reward , despite the model uncertainty on ’s and the nonstationarity of the learning environment. To measure the convergence of an online algorithm to optimality, we consider an equivalent objective of dynamic regret minimization. For each time , let be the optimum long term reward for the online MDP problem with stationary transition kernel and stationary mean reward function To compute we refer to the linear programming formulation (18) in Section A. The DM aims to design policy to minimize the dynamic regret
We note that this regret notion recovers the dynamic regret definition in bandit setting when specified to bandits. Different from bandit, the expected total reward received could be much higher than the benchmark as the latter does not take into account the starting state. We also review the relevant concepts of communicating MDPs and their diameters. {definition}[Hitting Time, Communicating MDPs and Diameters] Consider a set of states , a collection of action sets, and a transition kernel . For any and any stationary policy , the hitting time from to under is the random variable
which can be infinite. We say that constitutes a communicating MDP if and only if
is finite. The quantity is the diameter associated with . To enable learning in a nonstationary environment, we make the following reachability assumption. {assumption} For time , the tuple constitutes a communicating MDP with diameter at most . We denote .
4 Sliding Window UCRL2 with Confidence Widening
In this section, we present the SWUCRL2CW algorithm, which incorporates sliding window estimates (Garivier and Moulines 2011) and a novel confidence widening technique into the UCRL2 algorithm (Jaksch et al. 2010).
4.1 Design Overview
The SWUCRL2CW algorithm first specifies a window length and runs in a sequence of episodes that partitions the time steps. Episode starts at round (in particular ), and ends at the end of round . Throughout an episode the DM follows a certain stationary policy It ceases the episode if at least one of the following two criteria is met:

The round index is a multiple of This ensures that each episode is at most time steps long, and prevents choosing actions based on outofdate information

There exists some (state,action) pair such that the number of visits to them within episode is at least as many as the total number of visits to them within the rounds prior to i.e., rounds This is similar to the doubling criterion in (Jaksch et al. 2010), which ensures that each episode is sufficiently long so that the DM can focus on learning.
The combined effect of these two criteria allows the DM to learn a nearoptimal policy with historical data from an appropriately sized time window. One important piece of ingredient is the construction of the policy for each episode. To allow learning under nonstationarity, the SWUCRL2CW algorithm computes the policy for the episode based on the history in the previous rounds before the current episode i.e., between round and . The construction of involves the Extended Value Iteration (EVI) (Jaksch et al. 2010), which requires the confidence/uncertainty regions for rewards and transition kernels as the inputs, in addition to an error parameter . We note that the parameter is a certain confidence widening parameter, which contributes to ensure the output MDP of the EVI has a bounded diameter.
4.2 Policy Construction
For ease of discussion, we first define for each stateaction pair and each belongs to episode
(1) 
1) Confidence Region for Rewards. For each stateaction pair and each belongs to episode , we consider the empirical mean estimator
which has mean
The confidence region is with
(2) 
where .
2) Confidence Region and Confidence Widening for Transition Kernels. For each and each time in episode , we also consider the empirical mean estimator
(3) 
which has mean
(4) 
The confidence region is with
(5) 
where .
3) Extended Value Iteration (EVI) (Jaksch et al. 2010). The SWUCRL2CW algorithm relies on the EVI, which solves MDPs with optimistic exploration to nearoptimality. We extract (and rephrase) a description of EVI in Appendix A.1. EVI inputs the confidence regions for the rewards and the transition kernels. It then outputs an “optimistic MDP model”, which consists of reward vector and transition kernel under which the optimal average gain is the largest among all reward vectors and transition kernels in the supplied confidence regions. Specifically, it works as follows:

Input: confidence regions for , for and an arbitrarily small error parameter

Output: The returned policy and the auxiliary output Here, and are the selected “optimistic” reward vector, transition kernel, and the corresponding long term average reward; While is the bias vector (Jaksch et al. 2010). Collectively, we express . Note that but for some .
The output of the EVI further satisfy the following two properties.
Property 1
The dual variables is optimistic, i.e.,
Property 2
For each state , we have
Consequently, is an optimal (up to an additive factor of ) deterministic policy for MDP
With these, the formal description of the SWUCRL2CW algorithm is shown in Algorithm 1.
4.3 Performance Analysis
We are now ready to analyze the performance of the SWUCRL2CW algorithm. As we are under a changing environment, the dynamic regret bound of SWUCRL2CW algorithm shall adapt to the underlying shifts (Besbes et al. 2014). We thus define the following variation measures for the drifts in rewards and transition kernels across time:
(6)  
(7) 
One can interpret the quantities as an upper bound on the variations in rewards and transition kernels between round and , and the quantities ^{}^{}endnote: Clearly, we cannot hope to achieve a dynamic regret sublinear in if or is , so we focus on the case when . are the total allowable variations for rewards and transition kernels in rounds. Similar to (but less restrictive than) (Yu and Mannor 2009), we further make the following assumption to imposes the transition kernel changes slowly at a steady rate across time. {assumption} The transition kernel’s variation budgets are uniform over time, i.e.,
Together with the confidence widening technique, we are guaranteed the resulted MDP of the EVI has a bounded diameter. For ease of exposition, we also define the piecewise variations for each belongs to episode
To proceed, we introduce two events which state that the estimated reward and transition kernels lie in the confidence region.
We prove that hold with high probability. {lemma} We have , . The proof is in Section B of the appendix. We then bound the dynamic regret of each round. {proposition} Conditioning on events and assuming that , for every episode and every time in the episode (i.e. ) the following inequality holds for certainty:
Proof.
Proof Sketch. The complete proof can be found in Section C of the appendix. Consider time , which belongs episode . In a high level, the proof goes through three steps:

The estimated mean reward (used in computing ) and the true reward differ by at most

Finally, we relate the optimistic transition kernel and the underlying kernel to account for the extra loss due to a widened confidence region. Unifying all three steps, we can conclude the statement.
Combining the above, we can conclude the statement. \halmos∎
Suppose we denote , and use the notation to hide logarithmic dependence on and the confidence parameter for defining the confidence regions , our first main result is a dynamic regret upper bound on the SWUCRL2CW algorithm. {theorem} Assuming the SWUCRL2CW algorithm with window size and satisfies the dynamic regret bound
(8) 
with probability . If we further put
(9)  
(10) 
this is
The complete proof of Theorem 4.3 is presented in Section D of the appendix. {remark} When , our problem model specializes to the nonstationary bandit problem studied by (Besbes et al. 2014). In this case, we have , and we are left with the first term in (8). By choosing , our algorithm has dynamic regret , matching the minimax optimal dynamic regret bound by (Besbes et al. 2014). {remark} Different from (Cheung et al. 2019), there is no straightforward way of setting to get a nontrivial bound when are not known. While (Cheung et al. 2019) provide a way to set their window size (oblivious of their variational budget ) so that the dynamic regret is in their linear bandit setting, in our setting we still need to have , which is a prior not clear how to ensure when we don’t know .
4.4 Uniform Variation Budget, Confidence Widening, and Alternatives
We now pause for a while to comment on Assumption 4.3 and the technique of confidence widening.
In online stochastic environment, one usually take time average of observed samples to estimate a certain latent quantity, even when the sample distributions vary with time. This has been proved to work well in the nonstationary bandit settings Garivier and Moulines (2011), Cheung et al. (2019). For online MDPs, one typically look at the time average MDP in (3), which estimate in (4) to within an additive error for any pair of . In the case of stationary MDPs where , one has and thus can conclude that the unwidened confidence region contains with high probability. An immediate consequence is that the EVI w.r.t. would return a policy with bounded difference bias vector as has diameter (Please see Section 4.3 of (Jaksch et al. 2010)). These further ensure that the optimistic long term average reward is not far away from the true long term average reward, e.g., step 2) in the proof of Proposition 4.3.
Nevertheless, this is not always the case under nonstationarity. Although (Jaksch et al. 2010, Gajane et al. 2018) uses for piecewise stationary MDP setting, they crucially exploit the fact that the MDP remain unchanged between jumps, and can treat the problem as if it is stationary. For changing environments, one can only guarantee that the transition kernel with high probability, but unsure about the true ’s due to the drift. In Section E.1 of the appendix, we show that the diameter of can grow as and the EVI w.r.t. can only promise a policy with bias vector bounded by which makes the dynamic regret bound vacuous for drifting environments. By assumption 4.3 and the confidence widening technique, we are guaranteed that for each episode and can proceed as what we have done.
Alternatively, if there exists such that for any states there is always an action with then the MDP has diameter and it can shown that SWUCRL2CW algorithm with enjoys a dynamic regret bound by similar techniques. As we shall shown in Section E.2, this assumption can be easily satisfied in many realistic applications.
5 BanditoverReinforcement Learning: Towards ParameterFree
As pointed out by Remark 4.3, in the case of unknown and the DM cannot implement the SWUCRL2CW algorithm as the magnitude of confidence widening cannot be determined. To handle this case, we wish to design an online algorithm that can attain reasonable dynamic regret bound in a parameterfree manner. By Theorem 4.3, we are assured: under Assumption 4.3, a fixed pair of parameters can ensure low regret. For the bandit setting, (Cheung et al. 2019) proposes the banditoverbandit framework that uses a separate copy of EXP3 algorithm to tune the window length. Inspired by it, we develop a novel BanditoverReinforcement Learning (BORL) algorithm with parameterfree dynamic regret here.
5.1 Design Overview
Following a similar line of reasoning as (Cheung et al. 2019), we make use of the SWUCRL2CW algorithm as a subroutine, and “hedge” (Bubeck and CesaBianchi 2012) against the (possibly adversarial) changes of ’s and ’s to identify a reasonable fixed window length and confidence widening parameter.
As illustrated in Fig. 1, the BORL algorithm divides the whole time horizon into blocks of equal length rounds (the length of the last block can ), and specifies a set from which each pair of (window length, confidence widening) parameter are drawn from. For each block , the BORL algorithm first calls some master algorithm to select a pair of (window length, confidence widening) parameters , and restarts the SWUCRL2CW algorithm with the selected parameters as a subroutine to choose actions for this block. Afterwards, the total reward of block is fed back to the master, and the “posterior” of these parameters are updated accordingly.
One immediate challenge not presented in the bandit setting (Cheung et al. 2019) is that the starting state of each block is determined by previous moves of the DM. Hence, the master algorithm is not facing a simple oblivious environment as the case in bandit setting where there is only one sate, but fortunately, the state is observed before the starting of a block. To this end, we use the EXP3.P algorithm for multiarmed bandit against an adaptive adversary (Auer et al. 2002, Bubeck and CesaBianchi 2012) as the master algorithm.
5.2 Design Details
We are now ready to state the details of the BORL algorithm. For some fixed choice of block length (to be determined later), we first define a couple of additional notations:
(11)  
Here, and are all possible choices of window length and confidence widening parameter, respectively, and is the Cartesian product of them with We emphasize that due to the restarting, any instance of the SWUCRL2CW algorithm cannot last for more than rounds. Consequently, even if the EXP3.P selects a window length the effective window length is and we make We also let be the total rewards for running the SWUCRL2CW algorithm with window length and confidence widening parameter for rounds starting from state
The EXP3.P algorithm (Bubeck and CesaBianchi 2012) treats each element of as an arm. It begins by initializing
(12) 
where At the beginning of each block the BORL algorithm first sees the state and computes
(13) 
Then it sets with probability The selected pair of parameters are thus and Afterwards, the BORL algorithm starts from state selects actions by running the SWUCRL2CW algorithm with window length and confidence widening parameter for each round in block At the end of the block, the BORL algorithm observes the total rewards As a last step, it rescales by dividing it by so that it is within and updates
(14) 
The formal description of the BORL algorithm (with defined in the next subsection) is shown in Algorithm 2.
5.3 Performance Analysis
To analyze the performance of the BORL algorithm, we consider the following regret decomposition, for any choice of we have
(15) 
For the first term in eq. (15), we can apply the results from Theorem 4.3 to each block i.e.,
(16) 
where we have defined
for brevity. For the second term, it captures the additional rewards of the DM were it uses the fixed parameters throughout w.r.t. the trajectory on the starting states of each block by the BORL algorithm, i.e., and this is exactly the regret of the EXP3.P algorithm when it is applied to a arm adaptive adversarial bandit problem with reward from Therefore, for any choice of we can upper bound this by
as Summing these two, the regret of the BORL algorithm is
(17) 
We now point out a key tradeoff in the choice of

On one hand, should be small enough so that the regret bound in eq. (17) is small.

On the others, should be large to allow to get close to even when is small.
To this end, we pick
We also justify the choice of and formalize the dynamic regret bound of the BORL algorithm as follows. {theorem} Assume that the dynamic regret bound of the BORL algorithm is
with probability The complete proof can be found in Section F of the Appendix.
References
 AbbasiYadkori et al. (2013) AbbasiYadkori, Yasin, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvári. 2013. Online learning in markov decision processes with adversarially chosen transition probability distributions. Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS).
 Agrawal and Jia (2017) Agrawal, Shipra, Randy Jia. 2017. Optimistic posterior sampling for reinforcement learning: worstcase regret bounds. I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, eds., Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 1184–1194.
 Auer et al. (2002) Auer, P., N. CesaBianchi, Y. Freund, R. Schapire. 2002. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002, Vol. 32, No. 1 : pp. 48–77.
 Bartlett and Tewari (2009) Bartlett, Peter L., Ambuj Tewari. 2009. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating mdps. UAI 2009, Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 1821, 2009. 35–42.
 Bertsekas (2017) Bertsekas, Dimitri. 2017. Dynamic Programming and Optimal Control. Athena Scientific.
 Besbes et al. (2014) Besbes, Omar, Yonatan Gur, Assaf Zeevi. 2014. Stochastic multiarmed bandit with nonstationary rewards. Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS).
 Bubeck and CesaBianchi (2012) Bubeck, S., N. CesaBianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multiarmed Bandit Problems. Foundations and Trends in Machine Learning, 2012, Vol. 5, No. 1: pp. 1–122.
 Chen et al. (2019) Chen, Yifang, ChungWei Lee, Haipeng Luo, ChenYu Wei. 2019. A new algorithm for nonstationary contextual bandits: Efficient, optimal, and parameterfree. Proceedings of Conference on Learning Theory (COLT).
 Cheung et al. (2019) Cheung, Wang Chi, David SimchiLevi, Ruihao Zhu. 2019. Learning to optimize under nonstationarity. Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS).
 Dick et al. (2014) Dick, Travis, András György, Csaba Szepesvári. 2014. Online learning in markov decision processes with changing cost sequences. Proceedings of the International Conference on Machine Learning (ICML).
 EvenDar et al. (2005) EvenDar, Eyal, Sham M Kakade, , Yishay Mansour. 2005. Experts in a markov decision process. Proceedings of the 19th Annual Conference on Neural Information Processing Systems (NIPS).
 Fruit et al. (2018a) Fruit, Ronan, Matteo Pirotta, Alessandro Lazaric. 2018a. Near optimal explorationexploitation in noncommunicating markov decision processes. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, R. Garnett, eds., Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2998–3008.
 Fruit et al. (2018b) Fruit, Ronan, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner. 2018b. Efficient biasspanconstrained explorationexploitation in reinforcement learning. Jennifer Dy, Andreas Krause, eds., Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80. PMLR, StockholmsmÃ¤ssan, Stockholm Sweden, 1578–1586.
 Gajane et al. (2018) Gajane, Pratik, Ronald Ortner, Peter Auer. 2018. A slidingwindow algorithm for markov decision processes with arbitrarily changing rewards and transitions. CoRR abs/1805.10066. URL http://arxiv.org/abs/1805.10066.
 Garivier and Moulines (2011) Garivier, A., E. Moulines. 2011. On upperconfidence bound policies for switching bandit problems. Proceedings of International Conferenc on Algorithmic Learning Theory (ALT).
 Hoeffding (1963) Hoeffding, Wassily. 1963. Probability inequalities for sums of bounded random variables. Journal of the American statistical association 58(301) 13–30.
 Jadbabaie et al. (2015) Jadbabaie, Ali, Alexander Rakhlin, Shahin Shahrampour, Karthik Sridharan. 2015. Online optimization : Competing with dynamic comparators. Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS).
 Jaksch et al. (2010) Jaksch, Thomas, Ronald Ortner, Peter Auer. 2010. Nearoptimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11 1563–1600.
 Karnin and Anava (2016) Karnin, Z., O. Anava. 2016. Multiarmed bandits: Competing with optimal sequences. Procedding of Annual Conference on Neural Information Processing Systems (NIPS).
 Keskin and Zeevi (2016) Keskin, N., A. Zeevi. 2016. Chasing demand: Learning and earning in a changing environments. Mathematics of Operations Research, 2016, 42(2), 277–307.
 Lattimore and Szepesvári (2018) Lattimore, T., C. Szepesvári. 2018. Bandit Algorithms. Cambridge University Press.
 Luo et al. (2018) Luo, H., C. Wei, A. Agarwal, J. Langford. 2018. Efficient contextual bandits in nonstationary worlds. Proceedings of Conference on Learning Theory (COLT).
 Nilim and Ghaoui (2005) Nilim, Arnab, Laurent El Ghaoui. 2005. Robust control of markov decision processes with uncertain transition matrices. Operations Research.
 Qin et al. (2019) Qin, Zhiwei (Tony), Jian Tang, Jieping Ye. 2019. Deep reinforcement learning with applications in transportation. Tutorial of the 33rd AAAI Conference on Artificial Intelligence (AAAI19).
 Weissman et al. (2003) Weissman, Tsachy, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, , Marco L. Weinberger. 2003. Inequalities for the l1 deviation of the empirical distribution. Technical Report HPL200397, HP Laboratories Palo Alto: www.hpl.hp.com/techreports/2003/HPL200397R1..
 Xu and Mannor (2006) Xu, Huan, Shie Mannor. 2006. The robustnessperformance tradeoff in markov decision processes. Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS).
 Yu and Mannor (2009) Yu, Jia Yuan, Shie Mannor. 2009. Online learning in markov decision processes with arbitrarily changing rewards and transitions. Proceedings of the International Conference on Game Theory for Networks.
 Zhang and Wang (2018) Zhang, Anru, Mengdi Wang. 2018. Spectral state compression of markov processes. https://arxiv.org/abs/1802.02920.
Supplementary
Appendix A Supplementary Details about MDPs
The optimal long term reward is equal to the optimal value of the linear program . For a reward vector and a transition kernel , we define
(18)  
s.t.  
Throughout our analysis, it is useful to consider the following dual formulation of the optimization problem :
(19)  
s.t.  
The following Lemma shows that any feasible solution to is essentially bounded if the underlying MDP is communicating, which will be crucial in the subsequent analysis. {lemma} Let be a feasible solution to the dual problem , where consititute a communicating MDP with diameter . We have
The Lemma is extracted from Section 4.3.1 of (Jaksch et al. 2010), and it is more general than (Lattimore and Szepesvári 2018), which requires to be optimal instead of just feasible.