Distribution Estimation in Discounted MDPs via a Transformation
Abstract
Although the general deterministic reward function in MDPs takes three arguments—current state, action, and next state; it is often simplified to a function of two arguments—current state and action. The former is called a transitionbased reward function, whereas the latter is called a statebased reward function. When the objective is a function of the expected cumulative reward only, this simplification works perfectly. However, when the objective is risksensitive—e.g., depends on the reward distribution, this simplification leads to incorrect values of the objective. This paper studies the distribution estimation of the cumulative discounted reward in infinitehorizon MDPs with finite state and action spaces. First, by taking the ValueatRisk (VaR) objective as an example, we illustrate and analyze the error from the above simplification on the reward distribution. Next, we propose a transformation for MDPs to preserve the reward distribution and convert transitionbased reward functions to deterministic statebased reward functions. This transformation works whether the transitionbased reward function is deterministic or stochastic. Lastly, we show how to estimate the reward distribution after applying the proposed transformation in different settings, provided that the distribution is approximately normal.
1 Introduction
In general reinforcement learning (RL) settings, two important functions are derived from the reward function: the value function on state space , and the Qfunction on stateaction space . Both functions play important roles in RL since they represent the expected overall values. However, the reward function is usually in a more complicated form, it could be transitionbased (on stateactionstate space), or (and) stochastic, and the reward simplification usually leads to a different reward distribution. This is because the simplification only keeps the firstmoment information of the reward distribution. This paper aims to solve the risksensitive problems with some technique requiring the reward to be deterministic and depends only on current state (and action), and at the same time to keep the distribution intact.
We focus on the return^{2}^{2}2In this paper we focus on, but the word is not limited to, the discounted total reward . distribution estimation in an infinitehorizon MDP with finite state and action spaces, and consider the ValueatRisk (VaR) objective as a risksensitive example. We illustrate the higher moment information loss from the reward simplification on the return distribution in a stationary Markov reward process setting, and generalize the transformation (Ma and Yu, 2017) for MDPs with stochastic reward functions to keep the return distribution intact. Furthermore, we show that the return distribution can be estimated effectively when the distribution is approximately normal.
1.1 Literature
The risk concerns arise in an RL problem in two aspects. One refers to the “external” uncertainty about the model parameters, and this problem is known as the robust MDPs. In robust MDPs people optimize the expected return with worstcase parameters, which belongs to a set of plausible MDP parameters. For example, an MDP with uncertain transition matrices (Nilim and Ghaoui, 2005).
This paper concerns the “internal” risk, which is about stochastic property of the process itself. Two risksensitive objective classes have been examined in recent years. One is the coherent risk measure (Artzner et al., 1998), which occupies a set of intuitively reasonable properties (convexity, for example). Ruszczyński and Shapiro (2006) presented a thorough study on coherent risk optimization. The other important class is the meanvariance measure (White, 1988; Sobel, 1994; Mannor and Tsitsiklis, 2011), in which the expected cumulative reward is maximized with a given risk level (variance). It is also known as modern portfolio theory. The internal risk concerns arise not only mathematically but also psychologically. A classic example in psychology is the “St. Petersburg Paradox,” which refers to a lottery with an infinite expected reward, but people only prefer to pay a small amount to play. This problem is thoroughly studied in utility theory, and a recent study brought this idea to reinforcement learning (Prashanth et al., 2016).
ValueatRisk originates from finance. For a given portfolio (which can be considered as an MDP with a policy), a loss threshold (target level), and a timehorizon, VaR concerns the probability that the loss on the portfolio exceeds the threshold over the time horizon. Two VaR problems defined in the next section are solved by estimating the VaR function, which is the infimum of the return distribution set. Since the VaR objective is not coherent, we choose it as an example.
Central limit theorem (CLT) for Markov chain is studied for decades. Most works in this field are for the partial sum of rewards. Under different conditions, the distribution of the partial sum can be estimated (Jones, 2004; Meyn and Tweedie, 2009). Taking the advantage of the variance calculation method presented by Sobel (1982), we estimate the return distribution for a Markov reward process assuming it is approximately normal. Noticing that both the reward distribution estimation and the variance calculation method requires the reward function to be deterministic and statebased, so the transformation is needed for the MDPs with other types of reward functions. The generalized transformation is throughly studied in Section 3.
1.2 Overview
For risksensitive objectives in an infinitehorizon MDPs, we estimate the return distribution in a stationary scenario. We take VaR as an example to show the effect of reward simplification on distribution, and generalize a transformation to keep the distribution intact in most circumstances.
In Section 2, firstly, we define the MDP notations with four types of reward functions and two policy spaces, and pin down the reward simplification as the main problem. Secondly, two VaR objectives are introduced as examples to show the effect of the reward simplification. Thirdly, an infinitehorizon MDP for an inventory control problem is described to show the error from the reward simplification.
In Section 3, firstly, we assume the return is normally distributed, which is a fair assumption for the ergodic Markov reward processes with a discount factor close to 1. Secondly, we evolve the transformation for MDPs with stochastic reward functions in three cases.
In short, when the objective is risksensitive, the return distribution should be preserved instead of the expectation only. When the reward function is deterministic and statebased, or the policy is randomized, the generalized transformation should be carried out first. For related studies which concerned risksensitive problems in RL, we believe that they should be revisited with watching out for the reward simplification.
2 Preliminaries and Notations
In this section, firstly we present the notations for MDPs with four types of reward functions and two policy space, which are concerned in the next section. Secondly, the VaR objectives are defined, as well as the VaR function, which depends on the return distribution set from all policies. Thirdly, an inventory control problem is described, which is a straightforward example of MDP with a transitionbased reward function.
2.1 Markov Decision Processes (MDPs)
In this paper we focus on infinitehorizon discretetime MDPs, which can be represented by
in which is a finite state space, and denotes the state at (decision) epoch ; is the legitimate action set for , is a finite action space, and denotes the action at epoch ; is a bounded reward function, and denote the reward at epoch by ; denotes the homogeneous transition probability; ; is the initial state distribution; is the discount factor.
In this paper we study the distribution of the return in infinitehorizon MDPs. For , here we consider four types of reward functions.

The deterministic statebased reward ;

the deterministic transitionbased reward ;

the stochastic statebased reward ; and

the stochastic transitionbased reward^{3}^{3}3With a slight abuse of notation, we also represent for a Markov reward process. .
When the reward function is not type, it is often naively simplified in the expectation way. For example, given a , the reward function can be simplified to type by
(1) 
where , and is the transition kernel. In practical problems, stochastic reward functions are often naively simplified to functions in a similar way.
In reinforcement learning, when the expected cumulative reward is considered, and the Qfunction or the value function is accessed. When the reward function is not an , it is often simplified to an in a naive way. The effect of the reward simplification on cumulative reward distribution is studied in (Ma and Yu, 2017). Here we estimate the distribution with assuming it is approximately normal, illustrate the similar effect on return distribution, and generalize the transformation for a wider usage.
A policy describes how to choose actions sequentially. For infinitehorizon MDPs, we focus on two stationary and Markovian policy space: the deterministic policy space , and the randomized policy space . A Markov reward process can be considered as an MDP with a policy. Randomized policy is often considered in constrained MDPs (Altman, 1999). Given an MDP with a randomized policy, the reward function is often naively simplified as well. Both naive reward simplifications change the return distribution. Considering most, if not all, risksensitive objectives are functional of the return distribution, we generalize the transformation for settings mentioned above, in order to keep the return distribution intact.
2.2 ValueatRisk (VaR)
Two VaR problems described in (Filar et al., 1995) are considered as optional objectives. Given a policy and an initial distribution , define the return by , and here we simplify it to . Denote the return distribution with the policy by , the specified policy space by . VaR addresses the following problems.
Definition 2.1.
Given a quantile , find the optimal threshold .
This problem refers to the quantile function, i.e., .
Definition 2.2.
Given a threshold , find the optimal quantile .
This problem concerns .
When the estimated return distribution is strictly increasing, any point along the function is (estimated) with or . Therefore, both VaR objectives refers to the infimum function, and here we call it the VaR function. Since the VaR function depends on the return distribution set, we consider VaR objective as a risksensitive example to show the effect of the reward simplification. See (Ma and Yu, 2017) for more details about the VaR function.
2.3 Inventory Problem MDP Description
Section 3.2.1 in (Puterman, 1994) described the model formulation and some assumptions for a singleproduct stochastic inventory control problem. Define the warehouse capacity , and the state space . Briefly, at time epoch , denote the inventory level by before the order, the order quantity by , the demand by with a timehomogeneous probability distribution , where , then we have .
For , denote the cost to order units by , a fixed cost for placing orders, then we have the order cost . Denote the revenue when units of demand is fulfilled by , the maintenance fee by . The real reward function is .
We set the parameters as follows. The fixed order cost , the variable order cost , the maintenance fee , the warehouse capacity , and the price . The probabilities of demands are , , respectively. The initial distribution . In this infinitehorizon MDP, the reward function is deterministic and transitionbased. The simplified reward function can be calculated by Equation 1, which is statebased.
As illustrated in Figure 1, now we have two MDPs with different reward functions: and .
3 Normal Distribution Estimation
In this section, we estimate the return distribution assuming it is approximately normal, and generalize the transformation for three cases. Firstly, we propose the assumption that the return is normally distributed, and review the variance calculation method. Secondly, we estimate the distributions for the processes with the naive reward simplification and its transformed counterpart, compared them with the empirical distribution, and illustrate the error from the simplification, as well as the validity of the normal distribution assumption for this MDP. Thirdly, we generalize the transformation for three cases.
3.1 Normal Distribution Assumption for Return
Instead of the expected (discounted) cumulative reward, we consider risk from the distributional perspective. The functional distribution estimation in Makrov reward processes have been studied for decades (Woodroofe, 1992; Meyn and Tweedie, 2009). However, there is no related CLT for the discounted sum of rewards. In this section, we assume that the return is normally distributed to simplify the needed information for distribution estimation.
Assumption 1.
For a Harris ergodic Markov reward process , the return is normally distributed.
This assumption holds intuitively, since the return can be considered as the partial sum and the rest part. Considering the existence of the discount factor, the rest part goes to zero exponentially; with the discount factor goes to 1, the partial sum is approximately normally distributed (Jones, 2004). Furthermore, this assumption only aims at simplifying the distribution information. How accurately the estimation represents the distribution relies on how crucial the first and second moments information is.
For an infinitehorizon Markov reward process with a deterministic statebased reward function, Sobel (1982) presented the formula for the variance of the return.
Theorem 3.1.
(Sobel, 1982) Given an infinitehorizon Markov reward process with the finite state space , the reward function deterministic statebased and bounded, and the discount factor . Denote the transition matrix by , in which . Denote the conditional return expectation by for any deterministic initial state , and the conditional expectation vector (value function) by . Similarly, denote the conditional return variance by , and the conditional variance vector by . Let denote the vector whose th component is . Then
Now with the aid of Theorem 3.1, we can estimate the return distribution for the ergodic Markov reward process under Assumption 1. But notice that the variance calculation method is for Markov reward process with a deterministic statebased reward function only. In next subsection we estimate the return distribution with the aid of the generalized transformation, and compare it with the one from the reward simplification.
3.2 StateTransition Transformation
Here we generalize the statetransition transformation (Ma and Yu, 2017) for three cases.

Case 1: a Markov reward process with a stochastic, transitionbased reward function^{4}^{4}4With a slight abuse of notation, we also call the reward function in the Markov reward process statebased (transitionbased) since one (two) state(s) is involved.;

Case 2: an MDP with a stochastic transitionbased reward function, and a randomized policy; and

Case 3: an MDP with a randomized policy space.
Case 1 can be considered as an MDP with a (or ) and a deterministic policy. Case 2 is usually concerned in constrained MDPs. Case 3 may help for direct policy search (gradient descent method, for example) from a distributional perspective. In all the three cases, the reward functions are often naively simplified in a similar way as in Equation 1, which will lose all moment information except for the first one (expectation). Noticing that the problem in (Ma and Yu, 2017) (Case 0: a Markov reward process with a deterministic, transitionbased reward function) is a special problem of Case 1, and denote this relationship by . In general, all four cases have the relationship
and we evolve the transformation for all cases in order.
3.2.1 Transformation Generalization for Case 1
In the Markov reward process setting, when the reward function is transitionbased, it is often naively simplified by Equation 1, and it will change all moments of the distribution except for the first moment. In other words, the naive simplification only keeps the expectation intact. In order to estimate the return distribution for the Markov reward processes with a deterministic transitionbased reward function as well, the statetransition transformation (Algorithm 1) proposed in (Ma and Yu, 2017) should be implemented first. Algorithm 1 shows that the transformation works in a finitehorizon setting with a salvage reward as well.
With the aids of Theorem 3.1 and Assumption 1, the return distribution for the Markov reward process with a deterministic transitionbased reward function can be estimated with implementing the transformation algorithm first. For the inventory MDP with policy , Figure 2 compares the return distributions for the two MDPs (one from the naive reward simplification, and the other with the transformation) with the averaged empirical return distribution, whose error bars representing the standard deviation of the mean. The simulation is repeated 50 times with a time horizon 1000, and .
The Kolmogorov–Smirnov statistic (Durbin, 1973) is used to quantify the distribution difference (error). Denote the averaged empirical return distribution by , the estimated distribution for the transformed process by , and the estimated distribution for the process with the naive reward simplification by . For the case in Figure 2, , and . The results show that, the naive reward simplification leads to a nontrivial estimation error, and the validity of the normal distribution assumption holds for this example.
Now consider the VaR objective. The VaR function is obtained by enumerating all the (deterministic) policies. Figure 3 shows the two estimated VaR functions. Since the VaR function can also be regarded as a return distribution, we can still use to measure the error from the reward simplification, and in this case . Denote the optimal quantile for the MDP with the naive reward simplification by , then the error bound for the optimal quantile , which is nontrivial in a risksensitive problem.
In Figure 2 we can tell that the distribution for the process with the reward simplification has a smaller variance. The reason can be intuitively explained by the analysis of variance (Scheffé, 1999). Taking the deterministic transitionbased reward function for example. Considering the possible rewards for the same current state as a group, the variance for includes the variances between groups and the variances within groups. When the reward function is naively simplified by Equation 1, the variances within groups are removed, so the variance is smaller. Same thing happens for other types of reward simplifications as well.
Remark 1 (Transformation for the process with a stochastic reward function).
In order to convert a Markov reward process with a ^{5}^{5}5Here consider the reward for a given transition pair as discrete random variable only. When it is continuous, the transformed state space will be continuous as well. (or , ) to one with a , and keep the return distribution intact at the same time, a bijective mapping between the state space and the possible “situation” space is needed. For a Markov reward process with , a possible situation can be defined by a tuple , in which .
3.2.2 Transformation Generalization for Case 2
Given an MDP with a randomized policy , the reward function is often naively simplified as well. Taking a deterministic statebased reward function for example, the reward function is simplified to . There are two ways to deal with the randomized policy in order to keep the distribution intact. One is to take it as a randomization of the reward, which means that will be converted to a stochastic function, whose value equals to with the probability . The other way is to consider action in the situation mentioned in Remark 5. Both ways result in the same transformed Markov reward process.
Theorem 3.2 (Transformation concerning policy).
For a Markov decision process with stochastic and transitionbased, and given a randomized policy , there exists a Markov reward process with deterministic and statebased, such that both processes have the same return distribution.
Here we generalize the transformation for Case 2 in Algorithm 2.
3.2.3 Transformation Generalization for Case 3
The two settings dealt above is for policy evaluation. In this subsection we prove the transformation for MDPs, which enables the policy searching techniques (stochastic gradient descent, for example) in a risksensitive scenario.
Theorem 3.3 (Transformation for MDPs).
Given an MDP with stochastic and transitionbased, there exists an MDP with deterministic and statebased, such that for any given policy (possibly randomized) for , there exists a corresponding policy for , such that both Markov reward processes have the same total reward distribution.
Proof.
The proof has two steps. Step 1 is to construct a second MDP and show that, for every possible sample path for the first MDP, there exists a corresponding sample path for the second MDP. Step 2 is to prove that, the probability of any possible sample path in first MDP equals to the probability of its counterpart in the second MDP.
Step 1:
Define . For , define . Define . In order to remove the dependency of the initial state distribution on policy, define a null state space , and . Define the state space .
For all , define the statebased reward function , and ; define the transition kernel , and ; the initial state distribution .
Now we have two MDPs. Let , and . For any sample path in , there exists a sample path
in . Therefore, we proved that for every possible sample path for the first MDP, there exists a corresponding sample path for the second MDP.
Step 2:
Next we prove the probabilities for the two sample paths are equal. Here we prove it by mathematical induction. Set time epoch to after the first in , and after the first in .
Denote the partial sample path till epoch by in . Given any for , the probability for the sample path before epoch in is
There exists a policy for , with , and . The probability for the sample path before epoch in is
Therefore, at epoch , the two partial sample paths share the same probability.
Assuming that the two partial sample paths share the same probability at epoch , then the probability for the sample path before epoch in is
The probability for the sample path before epoch in is
By induction, we proved that, the probability of any possible sample path in first MDP equals to the probability of its counterpart in the second MDP. As a subsequence of a sample path, each possible reward sequence for the two MDPs shares the same probability as well. Therefore, the total reward distributions for the two MDPs are the same. Theorem 3.3 is proved.
∎
Notice that Theorem 3.3 focus on the total reward distribution instead of the return distribution. When the return distribution is considered, the reward function should be multiplied by to compensate the time drift effect, which is brought by the null states.
4 Conclusion and Discussion
In this paper, we illustrate the effect of the naive reward simplification on distribution, and generalized the transformation for MDPs in different settings. By implementing the transformation instead of simplifying the reward function in the naive way, the MDPs with different types of reward functions and (or) randomized policies are converted to the ones with deterministic and statebased reward functions, as well as an intact return (total reward) distribution. The transformation algorithm generalization presents a platform for the traditional value function and Qfunction in risksensitive RL.
The transformation algorithm is suitable for stationary settings, where the MDP parameters does not vary with time. In a dynamic setting, algorithms (Dearden et al., 1998; Bellemare et al., 2017) based on the distributional Bellman equation (Morimura et al., 2010) can estimate the return distribution.
References
 Altman (1999) Altman, E. (1999). Constrained Markov Decision Processes. CRC Press.
 Artzner et al. (1998) Artzner, P., Delbaen, F., Eber, J., and Heath, D. (1998). Coherent measures of risk. Mathematical Finance, 9(3):1–24.
 Bellemare et al. (2017) Bellemare, M., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 449–458.
 Dearden et al. (1998) Dearden, R., Friedman, N., and Stuart, R. (1998). Bayesian Qlearning. In Proceedings of the 15th Association for the Advancement of Artificial Intelligence (AAAI), pages 761–768.
 Durbin (1973) Durbin, J. (1973). Distribution Theory for Tests based on the Sample Distribution Function. SIAM.
 Filar et al. (1995) Filar, J. A., Krass, D., Ross, K. W., and Member, S. (1995). Percentile performance criteria for limiting average Markov decision processes. IEEE Transactions on Automatic Control, 40(I):2–10.
 Jones (2004) Jones, G. L. (2004). On the Markov chain central limit theorem. Probability surveys, 1(299320):5–1.
 Ma and Yu (2017) Ma, S. and Yu, J. Y. (2017). Transitionbased versus statebased reward functions for MDPs with ValueatRisk. In Proceedings of the 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 974–981.
 Mannor and Tsitsiklis (2011) Mannor, S. and Tsitsiklis, J. (2011). Meanvariance optimization in Markov decision processes. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1–22.
 Meyn and Tweedie (2009) Meyn, S. P. and Tweedie, R. L. (2009). Markov Chains and Stochastic Stability. Springer Science & Business Media.
 Morimura et al. (2010) Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., and Tanaka, T. (2010). Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 799–806.
 Nilim and Ghaoui (2005) Nilim, A. and Ghaoui, L. E. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
 Prashanth et al. (2016) Prashanth, L. A., Jie, C., Fu, M., Marcus, S., and Szepesvári, C. (2016). Cumulative prospect theory meets reinforcement learning: Prediction and control. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 1406–1415.
 Puterman (1994) Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
 Ruszczyński and Shapiro (2006) Ruszczyński, A. and Shapiro, A. (2006). Optimization of convex risk functions. Mathematics of Operations Research, 31(3):433–452.
 Scheffé (1999) Scheffé, H. (1999). The Analysis of Variance. John Wiley & Sons.
 Sobel (1982) Sobel, M. J. (1982). The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794–802.
 Sobel (1994) Sobel, M. J. (1994). Meanvariance tradeoffs in an undiscounted MDP. Operations Research, 42(1):175–183.
 White (1988) White, D. J. (1988). Mean , variance , and probabilistic criteria in finite Markov decision processes : A review. Journal of Optimization Theory and Applications, 56(1):1–29.
 Woodroofe (1992) Woodroofe, M. (1992). A central limit theorem for functions of a Markov chain with applications to shifts. Stochastic processes and their applications, 41(1):33–44.