# Temporal Regularization in Markov Decision Process

###### Abstract

Several applications of Reinforcement Learning suffer from instability due to high variance. This is especially prevalent in high dimensional domains. Regularization is a commonly used technique in machine learning to reduce variance, at the cost of introducing some bias. Most existing regularization techniques focus on spatial (perceptual) regularization. Yet in reinforcement learning, due to the nature of the Bellman equation, there is an opportunity to also exploit temporal regularization based on smoothness in value estimates over trajectories. This paper explores a class of methods for temporal regularization. We formally characterize the bias induced by this technique using Markov chain concepts. We illustrate the various characteristics of temporal regularization via a sequence of simple discrete and continuous MDPs, and show that the technique provides improvement even in high-dimensional Atari games.

Temporal Regularization in Markov Decision Process

Pierre Thodoroff McGill University pierre.thodoroff@mail.mcgill.ca Audrey Durand McGill University audrey.durand@mcgill.ca Joelle Pineau McGill University & Facebook AI Research jpineau@cs.mcgill.ca Doina Precup McGill University dprecup@cs.mcgill.ca

noticebox[b]32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.\end@float

## 1 Introduction

There has been much progress in Reinforcement Learning (RL) techniques, with some impressive success with games [30], and several interesting applications on the horizon [17, 29, 26, 9]. However RL methods are too often hampered by high variance, whether due to randomness in data collection, effects of initial conditions, complexity of learner function class, hyper-parameter configuration, or sparsity of the reward signal [15]. Regularization is a commonly used technique in machine learning to reduce variance, at the cost of introducing some (smaller) bias. Regularization typically takes the form of smoothing over the observation space to reduce the complexity of the learner’s hypothesis class.

In the RL setting, we have an interesting opportunity to consider an alternative form of regularization, namely temporal regularization. Effectively, temporal regularization considers smoothing over the trajectory, whereby the estimate of the value function at one state is assumed to be related to the value function at the state(s) that typically occur before it in the trajectory. This structure arises naturally out of the fact that the value at each state is estimated using the Bellman equation. The standard Bellman equation clearly defines the dependency between value estimates. In temporal regularization, we amplify this dependency by making each state depend more strongly on estimates of previous states as opposed to multi-step methods that considers future states.

This paper proposes a class of temporally regularized value function estimates. We discuss properties of these estimates, based on notions from Markov chains, under the policy evaluation setting, and extend the notion to the control case. Our experiments show that temporal regularization effectively reduces variance and estimation error in discrete and continuous MDPs. The experiments also highlight that regularizing in the time domain rather than in the spatial domain allows more robustness to cases where state features are mispecified or noisy, as is the case in some Atari games.

## 2 Related work

Regularization in RL has been considered via several different perspectives. One line of investigation focuses on regularizing the features learned on the state space [11, 25, 24, 10, 21, 14]. In particular backward bootstrapping method’s can be seen as regularizing in feature space based on temporal proximity [34, 20, 1]. These approaches assume that nearby states in the state space have similar value. Other works focus on regularizing the changes in policy directly. Those approaches are often based on entropy methods [23, 28, 2]. Explicit regularization in the temporal space has received much less attention. Temporal regularization in some sense may be seen as a “backward” multi-step method [32]. The closest work to ours is possibly [36], where they define natural value approximator by projecting the previous states estimates by adjusting for the reward and . Their formulation, while sharing similarity in motivation, leads to different theory and algorithm. Convergence properties and bias induced by this class of methods were also not analyzed in Xu et al. [36].

## 3 Technical Background

### 3.1 Markov chains

We begin by introducing discrete Markov chains concepts that will be used to study the properties of temporally regularized MDPs. A discrete-time Markov chain [19] is defined by a discrete set of states and a transition function which can also be written in matrix form as . Throughout the paper, we make the following mild assumption on the Markov chain:

###### Assumption 1.

The Markov chain P is ergodic: P has a unique stationary distribution .

In Markov chains theory, one of the main challenge is to study the mixing time of the chain [19]. Several results have been obtained when the chain is called reversible, that is when it satisfies detailed balance.

###### Definition 1 (Detailed balance [16]).

Let be an irreducible Markov chain with invariant stationary distribution ^{1}^{1}1 defines the th element of . A chain is said to satisfy detailed balance if and only if

(1) |

Intuitively this means that if we start the chain in a stationary distribution, the amount of probability that flows from to is equal to the one from to . In other words, the system must be at equilibrium. An intuitive example of a physical system not satisfying detailed balance is a snow flake in a coffee. Indeed, many chains do not satisfy this detailed balance property. In this case it is possible to use a different, but related, chain called the reversal Markov chain to infer mixing time bounds [7].

###### Definition 2 (Reversal Markov chain [16]).

Let the reversal Markov chain of be defined as:

(2) |

If is irreducible with invariant distribution , then is also irreducible with invariant distribution .

The reversal Markov chain can be interpreted as the Markov chain with time running backwards. If the chain is reversible, then .

### 3.2 Markov Decision Process

A Markov Decision Process (MDP), as defined in [27], consists of a discrete set of states , a transition function , and a reward function . On each round , the learner observes current state and selects action , after which it receives reward and moves to new state . We define a stationary policy as a probability distribution over actions conditioned on states .

#### 3.2.1 Discounted Markov Decision Process

When performing policy evaluation in the discounted case, the goal is to estimate the discounted expected return of policy at a state , , with discount factor . This can be obtained as the fixed point of the Bellman operator such that:

(3) |

where denotes the transition matrix under policy , is the state values column-vector, and is the reward column-vector. The matrix also defines a Markov chain.

In the control case, the goal is to find the optimal policy that maximizes the discounted expected return. Under the optimal policy, the optimal value function is the fixed point of the non-linear optimal Bellman operator:

(4) |

## 4 Temporal regularization

Regularization in the feature/state space, or spatial regularization as we call it, exploits the regularities that exist in the observation (or state). In contrast, temporal regularization considers the temporal structure of the value estimates through a trajectory. Practically this is done by smoothing the value estimate of a state using estimates of states that occurred earlier in the trajectory. In this section we first introduce the concept of temporal regularization and discuss its properties in the policy evaluation setting. We then show how this concept can be extended to exploit information from the entire trajectory by casting temporal regularization as a time series prediction problem.

Let us focus on the simplest case where the value estimate at the current state is regularized using only the value estimate at the previous state in the trajectory, yielding updates of the form:

(5) |

for a parameter and the transition probability induced by the policy . It can be rewritten in matrix form as , where corresponds to the reversal Markov chain of the MDP. We define a temporally regularized Bellman operator as:

(6) |

To alleviate the notation, we denote as and as .

###### Remark.

For , Eq. 6 corresponds to the original Bellman operator.

We can prove that this operator has the following property.

###### Theorem 1.

The operator has a unique fixed point and is a contraction mapping.

###### Proof.

We first prove that is a contraction mapping in norm. We have that

(7) |

where the last inequality uses the fact that the convex combination of two row stochastic matrices is also row stochastic (the proof can be found in the appendix). Then using Banach fixed point theorem, we obtain that is a unique fixed point. ∎

Furthermore the new induced Markov chain has the same stationary distribution as the original (the proof can be found in the appendix).

###### Lemma 1.

and have the same stationary distribution .

In the policy evaluation setting, the bias between the original value function and the regularized one can be characterized as a function of the difference between and its Markov reversal , weighted by and the reward distribution.

###### Proposition 1.

Let and . We have that

(8) |

This quantity is naturally bounded for .

###### Remark.

Let denote a matrix where columns consist of the stationary distribution . By the property of reversal Markov chains and lemma 1, we have that and , such that the Marvov chain and its reversal converge to the same value. Therefore, the norm also converges to 0 in the limit.

###### Remark.

It can be interesting to note that if the chain is reversible, meaning that , then the fixed point of both operators is the same, that is .

##### Discounted average reward case:

The temporally regularized MDP has the same discounted average reward as the original one as it is possible to define the discounted average reward [35] as a function of the stationary distribution , the reward vector and . This leads to the following property (the proof can be found in the appendix).

###### Proposition 2.

For a reward vector r, the MDPs defined by the the transition matrices and have the same average reward .

Intuitively, this means that temporal regularization only reweighs the reward on each state based on the Markov reversal, while preserving the average reward.

##### Temporal Regularization as a time series prediction problem:

It is possible to cast this problem of temporal regularization as a time series prediction problem, and use richer models of temporal dependencies, such as exponential smoothing [12], ARMA model [5], etc. We can write the update in a general form using different regularizers ():

(9) |

where and . For example, using exponential smoothing where , the update can be written in operator form as:

(10) |

and a similar argument as Theorem 1 can be used to show the contraction property. The bias of exponential smoothing in policy evaluation can be characterized as:

(11) |

Using more powerful regularizers could be beneficial, for example to reduce variance by smoothing over more values (exponential smoothing) or to model the trend of the value function through the trajectory using trend adjusted model [13]. An example of a temporal policy evaluation with temporal regularization using exponential smoothing is provided in Algorithm 1.

##### Control case:

Temporal regularization can be extended to MDPs with actions by modifying the target of the value function (or the Q values) using temporal regularization. Experiments (Sec. 5.6) present an example of how temporal regularization can be applied within an actor-critic framework. The theoretical analysis of the control case is outside the scope of this paper.

##### Temporal difference with function approximation:

It is also possible to extend temporal regularization using function approximation such as semi-gradient TD [33]. Assuming a function parameterized by , we can consider as the target and differentiate with respect to . An example of a temporally regularized semi-gradient TD algorithm can be found in the appendix.

## 5 Experiment

We now presents empirical results illustrating potential advantages of temporal regularization, and characterizing its bias and variance effects on value estimation and control.

### 5.1 Mixing time

This first experiment showcases that the underlying Markov chain of a MDP can have a smaller mixing time when temporally regularized. The mixing time can be seen as the number of time steps required for the Markov chain to get close enough to its stationary distribution. Therefore, the mixing time also determines the rate at which policy evaluation will converge to the optimal value function [3]. We consider a synthetic MDP with 10 states where transition probabilities are sampled from the uniform distribution. Let denote a matrix where columns consists of the stationary distribution . To compare the mixing time, we evaluate the error corresponding to the distance of and to the convergence point after iterations. Figure 2 displays the error curve when varying the regularization parameter . We observe a U-shaped error curve, that intermediate values of in this example yields faster mixing time. One explanation is that transition matrices with extreme probabilities (low or high) yield poorly conditioned transition matrices. Regularizing with the reversal Markov chain often leads to a better conditioned matrix at the cost of injecting bias.

### 5.2 Bias

It is well known that reducing variance comes at the expense of inducing (smaller) bias. This has been characterized previously (Sec. 4) in terms of the difference between the original Markov chain and the reversal weighted by the reward. In this experiment, we attempt to give an intuitive idea of what this means. More specifically, we would expect the bias to be small if values along the trajectories have similar values. To this end, we consider a synthetic MDP with states where both transition functions and rewards are sampled randomly from a uniform distribution. In order to create temporal dependencies in the trajectory, we smooth the rewards of states that are temporally close (in terms of trajectory) using the following formula: . Figure 2 shows the difference between the regularized and un-regularized MDPs as changes, for different values of regularization parameter . We observe that increasing , meaning more states get rewards close to one another, results into less bias. This is due to rewards putting emphasis on states where the original and reversal Markov chain are similar.

### 5.3 Variance

The primary motivation of this work is to reduce variance, therefore we now consider an experiment targeting this aspect. Figure 4 shows an example of a synthetic, 3-state MDP, where the variance of is (relatively) high. We consider an agent that is evolving in this world, changing states following the stochastic policy indicated. We are interested in the error when estimating the optimal state value of , , with and without temporal regularization, denoted , , respectively.

Figure 4 shows these errors at each iteration, averaged over runs. We observe that temporal regularization indeed reduces the variance and thus helps the learning process by making the value function easier to learn.

### 5.4 Propagation of the information

We now illustrate with a simple experiment how temporal regularization allows the information to spread faster among the different states of the MDP. For this purpose, we consider a simple MDP, where an agent walks randomly in two rooms (18 states) using four actions (up, down, left, right), and a discount factor . The reward is everywhere and passing the door between rooms (shown in red on Figure 5) only happens 50% of the time (on attempt). The episode starts at the top left and terminates when the agent reaches the bottom right corner. The sole goal is to learn the optimal value function by walking along this MDP (this is not a race toward the end).

Figure 5 shows the proximity of the estimated state value to the optimal value with and without temporal regularization. The darker the state, the closer it is to its optimal value. The heatmap scale has been adjusted at each trajectory to observe the difference between both methods. We first notice that the overall propagation of the information in the regularized MDP is faster than in the original one. We also observe that, when first entering the second room, bootstrapping on values coming from the first room allows the agent to learn the optimal value faster. This suggest that temporal regularization could help agents explore faster by using their prior from the previous visited state for learning the corresponding optimal value faster. It is also possible to consider more complex and powerful regularizers. Let us study a different time series prediction model, namely exponential averaging, as defined in (10). The complexity of such models is usually articulated by hyper-parameters, allowing complex models to improve performance by better adapting to problems. We illustrate this by comparing the performance of regularization using the previous state and an exponential averaging of all previous states. Fig. 6 shows the average error on the value estimate using past state smoothing, exponential smoothing, and without smoothing. In this setting, exponential smoothing transfers information faster, thus enabling faster convergence to the true value.

### 5.5 Noisy state representation

The next experiment illustrates a major strength of temporal regularization, that is its robustness to noise in the state representation. This situation can naturally arise when the state sensors are noisy or insufficient to avoid aliasing. For this task, we consider the synthetic, one dimensional, continuous setting. A learner evolving in this environment walks randomly along this line with a discount factor . Let denote the position of the agent along the line at time . The next position , where action . The state of the agent corresponds to the position perturbed by a zero-centered Gaussian noise , such that , where are i.i.d. When the agent moves to a new position , it receives a reward . The episode ends after 1000 steps. In this experiment we model the value function using a linear model with a single parameter . We are interested in the error when estimating the optimal parameter function with and without temporal regularization, that is and , respectively. In this case we use the TD version of temporal regularization presented at the end of Sec. 4. Figure 8 shows these errors, averaged over 1000 repetitions, for different values of noise variance . We observe that as the noise variance increases, the un-regularized estimate becomes less accurate, while temporal regularization is more robust. Using more complex regularizer can improve performance as shown in the previous section but this potential gain comes at the price of a potential loss in case of model misfit. Fig. 8 shows the absolute distance from the regularized state estimate (using exponential smoothing) to the optimal value while varying (higher = more smoothing). Increasing smoothing improves performance up to some point, but when is not well fit the bias becomes too strong and performance declines. This is a classic bias-variance tradeoff. This experiment highlights a case where temporal regularization is effective even in the absence of smoothness in the state space (which other regularization methods would target). This is further highlighted in the next experiments.

### 5.6 Deep reinforcement learning

To showcase the potential of temporal regularization in high dimensional settings, we adapt an actor-critic based method (PPO [28]) using temporal regularization. More specifically, we incorporate temporal regularization as exponential smoothing in the target of the critic. PPO uses the general advantage estimator where . We regularize such that using exponential smoothing as described in Eq. (10). is an exponentially decaying sum over all previous state values encountered in the trajectory. we evaluate the performance in the Arcade Learning Environment [4], where we consider the following performance measure:

(12) |

The hyper-parameters for the temporal regularization are and a decay of . Those are selected on 7 games and 3 training seeds. All other hyper-parameters correspond to the one used in the PPO paper. Our implementation^{2}^{2}2The code can be found https://github.com/pierthodo/temporal_regularization. is based on the publicly available OpenAI codebase [8]. The previous four frames are considered as the state representation [22].
For each game, independent runs ( random seeds) are performed.

The results reported in Figure 9 show that adding temporal regularization improves the performance on multiple games. This suggests that the regularized optimal value function may be smoother and thus easier to learn, even when using function approximation with deep learning. Also, as shown in previous experiments (Sec. 5.5), temporal regularization being independent of spatial representation makes it more robust to mis-specification of the state features, which is a challenge in some of these games (e.g. when assuming full state representation using some previous frames).

## 6 Discussion

##### Noisy states:

Is is often assumed that the full state can be determined, while in practice, the Markov property rarely holds. This is the case, for example, when taking the four last frames to represent the state in Atari games [22]. A problem that arises when treating a partially observable MDP (POMDP) as a fully observable is that it may no longer be possible to assume that the value function is smooth over the state space [31]. For example, the observed features may be similar for two states that are intrinsically different, leading to highly different values for states that are nearby in the state space. Previous experiments on noisy state representation (Sec. 5.5) and on the Atari games (Sec. 5.6) show that temporal regularization provides robustness to those cases. This makes it an appealing technique in real-world environments, where it is harder to provide the agent with the full state.

##### Choice of the regularization parameter:

The bias induced by the regularization parameter can be detrimental for the learning in the long run. A first attempt to mitigate this bias is just to decay the regularization as learning advances, as it is done in the deep learning experiment (Sec. 5.6). Among different avenues that could be explored, an interesting one could be to aim for a state dependent regularization. For example, in the tabular case, one could consider as a function of the number of visits to a particular state.

##### Smoother objective:

Previous work [18] looked at how the smoothness of the objective function relates to the convergence speed of RL algorithms. An analogy can be drawn with convex optimization where the rate of convergence is dependent on the Lipschitz (smoothness) constant [6]. By smoothing the value temporally we argue that the optimal value function can be smoother. This would be beneficial in high-dimensional state space where the use of deep neural network is required. This could explain the performance displayed using temporal regularization on Atari games (Sec. 5.6). The notion of temporal regularization is also behind multi-step methods [32]; it may be worthwhile to further explore how these methods are related.

##### Conclusion:

This paper tackles the problem of regularization in RL from a new angle, that is from a temporal perspective. In contrast with typical spatial regularization, where one assumes that rewards are close for nearby states in the state space, temporal regularization rather assumes that rewards are close for states visited closely in time. This approach allows information to propagate faster into states that are hard to reach, which could prove useful for exploration. The robustness of the proposed approach to noisy state representations and its interesting properties should motivate further work to explore novel ways of exploiting temporal information.

#### Acknowledgments

The authors wish to thank Pierre-Luc Bacon, Harsh Satija and Joshua Romoff for helpful discussions. Financial support was provided by NSERC and Facebook. This research was enabled by support provided by Compute Canada. We thank the reviewers for insightful comments and suggestions.

## References

- Baird [1995] L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.
- Bartlett and Tewari [2009] P. L. Bartlett and A. Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42. AUAI Press, 2009.
- Baxter and Bartlett [2001] J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
- Bellemare et al. [2013] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Box et al. [1994] G. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting and Control (3rd ed.). Prentice-Hall, 1994.
- Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
- Chung et al. [2012] K.-M. Chung, H. Lam, Z. Liu, and M. Mitzenmacher. Chernoff-hoeffding bounds for markov chains: Generalized and simplified. arXiv preprint arXiv:1201.0559, 2012.
- Dhariwal et al. [2017] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu. Openai baselines. https://github.com/openai/baselines, 2017.
- Dhingra et al. [2017] B. Dhingra, L. Li, X. Li, J. Gao, Y.-N. Chen, F. Ahmed, and L. Deng. Towards end-to-end reinforcement learning of dialogue agents for information access. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 484–495, 2017.
- Farahmand [2011] A.-m. Farahmand. Regularization in reinforcement learning. PhD thesis, University of Alberta, 2011.
- Farahmand et al. [2009] A.-m. Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor. Regularized fitted q-iteration for planning in continuous-space markovian decision problems. In American Control Conference, pages 725–730. IEEE, 2009.
- Gardner [2006] E. S. Gardner. Exponential smoothing: The state of the art—part ii. International journal of forecasting, 22(4):637–666, 2006.
- Gardner Jr [1985] E. S. Gardner Jr. Exponential smoothing: The state of the art. Journal of forecasting, 4(1):1–28, 1985.
- Harrigan [2016] C. Harrigan. Deep reinforcement learning with regularized convolutional neural fitted q iteration. 2016.
- Henderson et al. [2018] P. Henderson, R. Islam, and P. J. P. D. M. D. Bachman, P. Deep reinforcement learning that matters. In AAAI, 2018.
- Kemeny and Snell [1976] J. G. Kemeny and J. L. Snell. Finite markov chains, undergraduate texts in mathematics. 1976.
- Koedinger et al. [2018] K. Koedinger, E. Brunskill, R. Baker, and E. McLaughlin. New potentials for data-driven intelligent tutoring system development and optimization. AAAI magazine, 2018.
- Laroche [2018] V. S. Laroche. In reinforcement learning, all objective functions are not equal. ICLR Workshop, 2018.
- Levin and Peres [2008] D. A. Levin and Y. Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2008.
- Li [2008] L. Li. A worst-case comparison between temporal difference and residual gradient with linear function approximation. In Proceedings of the 25th international conference on machine learning, pages 560–567. ACM, 2008.
- Liu et al. [2012] B. Liu, S. Mahadevan, and J. Liu. Regularized off-policy td-learning. In Advances in Neural Information Processing Systems, pages 836–844, 2012.
- Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Neu et al. [2017] G. Neu, A. Jonsson, and V. Gómez. A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
- Pazis and Parr [2011] J. Pazis and R. Parr. Non-parametric approximate linear programming for mdps. In AAAI, 2011.
- Petrik et al. [2010] M. Petrik, G. Taylor, R. Parr, and S. Zilberstein. Feature selection using regularization in approximate linear programs for markov decision processes. arXiv preprint arXiv:1005.1860, 2010.
- Prasad et al. [2017] N. Prasad, L. Cheng, C. Chivers, M. Draugelis, and B. Engelhardt. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. In UAI, 2017.
- Puterman [1994] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
- Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Shortreed et al. [2011] S. Shortreed, E. Laber, D. Lizotte, S. Stroup, J. Pineau, and S. Murphy. Informing sequential clinical decision-making through reinforcement learning: an empirical study. Machine Learning, 2011.
- Silver et al. [2016] D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 2016.
- Singh et al. [1994] S. P. Singh, T. Jaakkola, and M. I. Jordan. Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings 1994, pages 284–292. Elsevier, 1994.
- Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1st edition, 1998.
- Sutton and Barto [2017] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge, (in progress) 2nd edition, 2017.
- Sutton et al. [2009] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000. ACM, 2009.
- Tsitsiklis and Van Roy [2002] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporal-difference learning. Machine Learning, 49(2-3):179–191, 2002.
- Xu et al. [2017] Z. Xu, J. Modayil, H. P. van Hasselt, A. Barreto, D. Silver, and T. Schaul. Natural value approximators: Learning when to trust past estimates. In Advances in Neural Information Processing Systems, pages 2117–2125, 2017.

## Appendix A Appendix

###### Lemma 1.

and have the same stationary distribution .

###### Proof.

It is known that and have the same stationary distribution. Using this fact we have that

(13) |

∎

###### Property 2.

For a reward vector r, the MDP defined by the the transition matrix and have the same discounted average reward .

(14) |

###### Proof.

Using lemma 1, both and have the same stationary distribution and so discounted average reward. ∎

###### Lemma 2.

The convex combination of two row stochastic matrices is also row stochastic.

###### Proof.

Let e be vector a columns vectors of 1.

(15) |

∎