Distribution Estimation in Discounted MDPs via a Transformation

Distribution Estimation in Discounted MDPs via a Transformation

Shuai Ma11footnotemark: 1    Jia Yuan Yu111Concordia Institute of Information System Engineering, Concordia University, Montréal, Quebec H3G 1M8, Canada (e-mail: m_shua@encs.concordia.ca and jiayuan.yu@concordia.ca).
Abstract

Although the general deterministic reward function in MDPs takes three arguments—current state, action, and next state; it is often simplified to a function of two arguments—current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective is a function of the expected cumulative reward only, this simplification works perfectly. However, when the objective is risk-sensitive—e.g., depends on the reward distribution, this simplification leads to incorrect values of the objective. This paper studies the distribution estimation of the cumulative discounted reward in infinite-horizon MDPs with finite state and action spaces. First, by taking the Value-at-Risk (VaR) objective as an example, we illustrate and analyze the error from the above simplification on the reward distribution. Next, we propose a transformation for MDPs to preserve the reward distribution and convert transition-based reward functions to deterministic state-based reward functions. This transformation works whether the transition-based reward function is deterministic or stochastic. Lastly, we show how to estimate the reward distribution after applying the proposed transformation in different settings, provided that the distribution is approximately normal.

1 Introduction

In general reinforcement learning (RL) settings, two important functions are derived from the reward function: the value function on state space , and the Q-function on state-action space . Both functions play important roles in RL since they represent the expected overall values. However, the reward function is usually in a more complicated form, it could be transition-based (on state-action-state space), or (and) stochastic, and the reward simplification usually leads to a different reward distribution. This is because the simplification only keeps the first-moment information of the reward distribution. This paper aims to solve the risk-sensitive problems with some technique requiring the reward to be deterministic and depends only on current state (and action), and at the same time to keep the distribution intact.

We focus on the return222In this paper we focus on, but the word is not limited to, the discounted total reward . distribution estimation in an infinite-horizon MDP with finite state and action spaces, and consider the Value-at-Risk (VaR) objective as a risk-sensitive example. We illustrate the higher moment information loss from the reward simplification on the return distribution in a stationary Markov reward process setting, and generalize the transformation (Ma and Yu, 2017) for MDPs with stochastic reward functions to keep the return distribution intact. Furthermore, we show that the return distribution can be estimated effectively when the distribution is approximately normal.

1.1 Literature

The risk concerns arise in an RL problem in two aspects. One refers to the “external” uncertainty about the model parameters, and this problem is known as the robust MDPs. In robust MDPs people optimize the expected return with worst-case parameters, which belongs to a set of plausible MDP parameters. For example, an MDP with uncertain transition matrices (Nilim and Ghaoui, 2005).

This paper concerns the “internal” risk, which is about stochastic property of the process itself. Two risk-sensitive objective classes have been examined in recent years. One is the coherent risk measure (Artzner et al., 1998), which occupies a set of intuitively reasonable properties (convexity, for example). Ruszczyński and Shapiro (2006) presented a thorough study on coherent risk optimization. The other important class is the mean-variance measure (White, 1988; Sobel, 1994; Mannor and Tsitsiklis, 2011), in which the expected cumulative reward is maximized with a given risk level (variance). It is also known as modern portfolio theory. The internal risk concerns arise not only mathematically but also psychologically. A classic example in psychology is the “St. Petersburg Paradox,” which refers to a lottery with an infinite expected reward, but people only prefer to pay a small amount to play. This problem is thoroughly studied in utility theory, and a recent study brought this idea to reinforcement learning (Prashanth et al., 2016).

Value-at-Risk originates from finance. For a given portfolio (which can be considered as an MDP with a policy), a loss threshold (target level), and a time-horizon, VaR concerns the probability that the loss on the portfolio exceeds the threshold over the time horizon. Two VaR problems defined in the next section are solved by estimating the VaR function, which is the infimum of the return distribution set. Since the VaR objective is not coherent, we choose it as an example.

Central limit theorem (CLT) for Markov chain is studied for decades. Most works in this field are for the partial sum of rewards. Under different conditions, the distribution of the partial sum can be estimated (Jones, 2004; Meyn and Tweedie, 2009). Taking the advantage of the variance calculation method presented by Sobel (1982), we estimate the return distribution for a Markov reward process assuming it is approximately normal. Noticing that both the reward distribution estimation and the variance calculation method requires the reward function to be deterministic and state-based, so the transformation is needed for the MDPs with other types of reward functions. The generalized transformation is throughly studied in Section 3.

1.2 Overview

For risk-sensitive objectives in an infinite-horizon MDPs, we estimate the return distribution in a stationary scenario. We take VaR as an example to show the effect of reward simplification on distribution, and generalize a transformation to keep the distribution intact in most circumstances.

In Section 2, firstly, we define the MDP notations with four types of reward functions and two policy spaces, and pin down the reward simplification as the main problem. Secondly, two VaR objectives are introduced as examples to show the effect of the reward simplification. Thirdly, an infinite-horizon MDP for an inventory control problem is described to show the error from the reward simplification.

In Section 3, firstly, we assume the return is normally distributed, which is a fair assumption for the ergodic Markov reward processes with a discount factor close to 1. Secondly, we evolve the transformation for MDPs with stochastic reward functions in three cases.

In short, when the objective is risk-sensitive, the return distribution should be preserved instead of the expectation only. When the reward function is deterministic and state-based, or the policy is randomized, the generalized transformation should be carried out first. For related studies which concerned risk-sensitive problems in RL, we believe that they should be revisited with watching out for the reward simplification.

2 Preliminaries and Notations

In this section, firstly we present the notations for MDPs with four types of reward functions and two policy space, which are concerned in the next section. Secondly, the VaR objectives are defined, as well as the VaR function, which depends on the return distribution set from all policies. Thirdly, an inventory control problem is described, which is a straightforward example of MDP with a transition-based reward function.

2.1 Markov Decision Processes (MDPs)

In this paper we focus on infinite-horizon discrete-time MDPs, which can be represented by

 ⟨S,A,r,p,μ,γ⟩,

in which is a finite state space, and denotes the state at (decision) epoch ; is the legitimate action set for , is a finite action space, and denotes the action at epoch ; is a bounded reward function, and denote the reward at epoch by ; denotes the homogeneous transition probability; ; is the initial state distribution; is the discount factor.

In this paper we study the distribution of the return in infinite-horizon MDPs. For , here we consider four types of reward functions.

1. The deterministic state-based reward ;

2. the deterministic transition-based reward ;

3. the stochastic state-based reward ; and

4. the stochastic transition-based reward333With a slight abuse of notation, we also represent for a Markov reward process. .

When the reward function is not type, it is often naively simplified in the expectation way. For example, given a , the reward function can be simplified to type by

 rDS(x,a)=∑y∈Sp(y|x,a)rDT(x,a,y), (1)

where , and is the transition kernel. In practical problems, stochastic reward functions are often naively simplified to functions in a similar way.

In reinforcement learning, when the expected cumulative reward is considered, and the Q-function or the value function is accessed. When the reward function is not an , it is often simplified to an in a naive way. The effect of the reward simplification on cumulative reward distribution is studied in (Ma and Yu, 2017). Here we estimate the distribution with assuming it is approximately normal, illustrate the similar effect on return distribution, and generalize the transformation for a wider usage.

A policy describes how to choose actions sequentially. For infinite-horizon MDPs, we focus on two stationary and Markovian policy space: the deterministic policy space , and the randomized policy space . A Markov reward process can be considered as an MDP with a policy. Randomized policy is often considered in constrained MDPs (Altman, 1999). Given an MDP with a randomized policy, the reward function is often naively simplified as well. Both naive reward simplifications change the return distribution. Considering most, if not all, risk-sensitive objectives are functional of the return distribution, we generalize the transformation for settings mentioned above, in order to keep the return distribution intact.

2.2 Value-at-Risk (VaR)

Two VaR problems described in (Filar et al., 1995) are considered as optional objectives. Given a policy and an initial distribution , define the return by , and here we simplify it to . Denote the return distribution with the policy by , the specified policy space by . VaR addresses the following problems.

Definition 2.1.

Given a quantile , find the optimal threshold .

This problem refers to the quantile function, i.e., .

Definition 2.2.

Given a threshold , find the optimal quantile .

This problem concerns .

When the estimated return distribution is strictly increasing, any point along the function is (estimated) with or . Therefore, both VaR objectives refers to the infimum function, and here we call it the VaR function. Since the VaR function depends on the return distribution set, we consider VaR objective as a risk-sensitive example to show the effect of the reward simplification. See (Ma and Yu, 2017) for more details about the VaR function.

2.3 Inventory Problem MDP Description

Section 3.2.1 in (Puterman, 1994) described the model formulation and some assumptions for a single-product stochastic inventory control problem. Define the warehouse capacity , and the state space . Briefly, at time epoch , denote the inventory level by before the order, the order quantity by , the demand by with a time-homogeneous probability distribution , where , then we have .

For , denote the cost to order units by , a fixed cost for placing orders, then we have the order cost . Denote the revenue when units of demand is fulfilled by , the maintenance fee by . The real reward function is .

We set the parameters as follows. The fixed order cost , the variable order cost , the maintenance fee , the warehouse capacity , and the price . The probabilities of demands are , , respectively. The initial distribution . In this infinite-horizon MDP, the reward function is deterministic and transition-based. The simplified reward function can be calculated by Equation 1, which is state-based.

As illustrated in Figure 1, now we have two MDPs with different reward functions: and .

3 Normal Distribution Estimation

In this section, we estimate the return distribution assuming it is approximately normal, and generalize the transformation for three cases. Firstly, we propose the assumption that the return is normally distributed, and review the variance calculation method. Secondly, we estimate the distributions for the processes with the naive reward simplification and its transformed counterpart, compared them with the empirical distribution, and illustrate the error from the simplification, as well as the validity of the normal distribution assumption for this MDP. Thirdly, we generalize the transformation for three cases.

3.1 Normal Distribution Assumption for Return

Instead of the expected (discounted) cumulative reward, we consider risk from the distributional perspective. The functional distribution estimation in Makrov reward processes have been studied for decades (Woodroofe, 1992; Meyn and Tweedie, 2009). However, there is no related CLT for the discounted sum of rewards. In this section, we assume that the return is normally distributed to simplify the needed information for distribution estimation.

Assumption 1.

For a Harris ergodic Markov reward process , the return is normally distributed.

This assumption holds intuitively, since the return can be considered as the partial sum and the rest part. Considering the existence of the discount factor, the rest part goes to zero exponentially; with the discount factor goes to 1, the partial sum is approximately normally distributed (Jones, 2004). Furthermore, this assumption only aims at simplifying the distribution information. How accurately the estimation represents the distribution relies on how crucial the first and second moments information is.

For an infinite-horizon Markov reward process with a deterministic state-based reward function, Sobel (1982) presented the formula for the variance of the return.

Theorem 3.1.

(Sobel, 1982) Given an infinite-horizon Markov reward process with the finite state space , the reward function deterministic state-based and bounded, and the discount factor . Denote the transition matrix by , in which . Denote the conditional return expectation by for any deterministic initial state , and the conditional expectation vector (value function) by . Similarly, denote the conditional return variance by , and the conditional variance vector by . Let denote the vector whose th component is . Then

 v=r′+γPv=(I−γP)−1r′,
 ψ=θ+γ2Pψ=(I−γ2P)−1θ.

Now with the aid of Theorem 3.1, we can estimate the return distribution for the ergodic Markov reward process under Assumption 1. But notice that the variance calculation method is for Markov reward process with a deterministic state-based reward function only. In next subsection we estimate the return distribution with the aid of the generalized transformation, and compare it with the one from the reward simplification.

3.2 State-Transition Transformation

Here we generalize the state-transition transformation (Ma and Yu, 2017) for three cases.

• Case 1: a Markov reward process with a stochastic, transition-based reward function444With a slight abuse of notation, we also call the reward function in the Markov reward process state-based (transition-based) since one (two) state(s) is involved.;

• Case 2: an MDP with a stochastic transition-based reward function, and a randomized policy; and

• Case 3: an MDP with a randomized policy space.

Case 1 can be considered as an MDP with a (or ) and a deterministic policy. Case 2 is usually concerned in constrained MDPs. Case 3 may help for direct policy search (gradient descent method, for example) from a distributional perspective. In all the three cases, the reward functions are often naively simplified in a similar way as in Equation 1, which will lose all moment information except for the first one (expectation). Noticing that the problem in (Ma and Yu, 2017) (Case 0: a Markov reward process with a deterministic, transition-based reward function) is a special problem of Case 1, and denote this relationship by . In general, all four cases have the relationship

 Case 0⊆Case 1⊆Case 2⊆Case 3,

and we evolve the transformation for all cases in order.

3.2.1 Transformation Generalization for Case 1

In the Markov reward process setting, when the reward function is transition-based, it is often naively simplified by Equation 1, and it will change all moments of the distribution except for the first moment. In other words, the naive simplification only keeps the expectation intact. In order to estimate the return distribution for the Markov reward processes with a deterministic transition-based reward function as well, the state-transition transformation (Algorithm 1) proposed in (Ma and Yu, 2017) should be implemented first. Algorithm 1 shows that the transformation works in a finite-horizon setting with a salvage reward as well.

With the aids of Theorem 3.1 and Assumption 1, the return distribution for the Markov reward process with a deterministic transition-based reward function can be estimated with implementing the transformation algorithm first. For the inventory MDP with policy , Figure 2 compares the return distributions for the two MDPs (one from the naive reward simplification, and the other with the transformation) with the averaged empirical return distribution, whose error bars representing the standard deviation of the mean. The simulation is repeated 50 times with a time horizon 1000, and .

The Kolmogorov–Smirnov statistic  (Durbin, 1973) is used to quantify the distribution difference (error). Denote the averaged empirical return distribution by , the estimated distribution for the transformed process by , and the estimated distribution for the process with the naive reward simplification by . For the case in Figure 2, , and . The results show that, the naive reward simplification leads to a nontrivial estimation error, and the validity of the normal distribution assumption holds for this example.

Now consider the VaR objective. The VaR function is obtained by enumerating all the (deterministic) policies. Figure 3 shows the two estimated VaR functions. Since the VaR function can also be regarded as a return distribution, we can still use to measure the error from the reward simplification, and in this case . Denote the optimal quantile for the MDP with the naive reward simplification by , then the error bound for the optimal quantile , which is nontrivial in a risk-sensitive problem.

In Figure 2 we can tell that the distribution for the process with the reward simplification has a smaller variance. The reason can be intuitively explained by the analysis of variance (Scheffé, 1999). Taking the deterministic transition-based reward function for example. Considering the possible rewards for the same current state as a group, the variance for includes the variances between groups and the variances within groups. When the reward function is naively simplified by Equation 1, the variances within groups are removed, so the variance is smaller. Same thing happens for other types of reward simplifications as well.

Remark 1 (Transformation for the process with a stochastic reward function).

In order to convert a Markov reward process with a 555Here consider the reward for a given transition pair as discrete random variable only. When it is continuous, the transformed state space will be continuous as well. (or , ) to one with a , and keep the return distribution intact at the same time, a bijective mapping between the state space and the possible “situation” space is needed. For a Markov reward process with , a possible situation can be defined by a tuple , in which .

3.2.2 Transformation Generalization for Case 2

Given an MDP with a randomized policy , the reward function is often naively simplified as well. Taking a deterministic state-based reward function for example, the reward function is simplified to . There are two ways to deal with the randomized policy in order to keep the distribution intact. One is to take it as a randomization of the reward, which means that will be converted to a stochastic function, whose value equals to with the probability . The other way is to consider action in the situation mentioned in Remark 5. Both ways result in the same transformed Markov reward process.

Theorem 3.2 (Transformation concerning policy).

For a Markov decision process with stochastic and transition-based, and given a randomized policy , there exists a Markov reward process with deterministic and state-based, such that both processes have the same return distribution.

Here we generalize the transformation for Case 2 in Algorithm 2.

3.2.3 Transformation Generalization for Case 3

The two settings dealt above is for policy evaluation. In this subsection we prove the transformation for MDPs, which enables the policy searching techniques (stochastic gradient descent, for example) in a risk-sensitive scenario.

Theorem 3.3 (Transformation for MDPs).

Given an MDP with stochastic and transition-based, there exists an MDP with deterministic and state-based, such that for any given policy (possibly randomized) for , there exists a corresponding policy for , such that both Markov reward processes have the same total reward distribution.

Proof.

The proof has two steps. Step 1 is to construct a second MDP and show that, for every possible sample path for the first MDP, there exists a corresponding sample path for the second MDP. Step 2 is to prove that, the probability of any possible sample path in first MDP equals to the probability of its counterpart in the second MDP.

Step 1:

Define . For , define . Define . In order to remove the dependency of the initial state distribution on policy, define a null state space , and . Define the state space .

For all , define the state-based reward function , and ; define the transition kernel , and ; the initial state distribution .

Now we have two MDPs. Let , and . For any sample path in , there exists a sample path

 (snull,x1,a1,0,(x1,a1,j1,x2),a2,j1,(x2,a2,j2,x3),a3,j2,(x3,a3,j3,x4),⋯)

in . Therefore, we proved that for every possible sample path for the first MDP, there exists a corresponding sample path for the second MDP.

Step 2:

Next we prove the probabilities for the two sample paths are equal. Here we prove it by mathematical induction. Set time epoch to after the first in , and after the first in .

Denote the partial sample path till epoch by in . Given any for , the probability for the sample path before epoch in is

 P(sp1=(x1,a1,j1,x2))=μ(x1)π(a1|x1)p(x2|x1,a1)r(j1|x1,a1,x2).

There exists a policy for , with , and . The probability for the sample path before epoch in is

 P(sp†1=(snull,x1,a1,0,(x1,a1,j1,x2)))=μ(snull,x1)π†(a1|snull,x1)p((x1,a1,j1,x2)|snull,x1,a1)=P(sp1=(x1,a1,j1,x2)).

Therefore, at epoch , the two partial sample paths share the same probability.

Assuming that the two partial sample paths share the same probability at epoch , then the probability for the sample path before epoch in is

 P(spn+1=(spn,xn+1,an+1,jn+1,xn+2))=P(spn=(x1,⋯,xn,an,jn,xn+1))×π(an+1|xn+1)p(xn+2|xn+1,an+1)r(jn+1|xn+1,an+1,xn+2).

The probability for the sample path before epoch in is

 P(sp†n+1=(sp†n,an+1,jn,(xn+1,an+1,jn+1,xn+2)))=P(sp†n=(snull,x1,⋯,(xn,an,jn,xn+1)))×π†(an+1|(xn,an,jn,xn+1))×p((xn+1,an+1,jn+1,xn+2)|(xn,an,jn,xn+1),an+1)×r(jn|(xn,an,jn,xn+1),an+1)=P(spn+1=(spn,xn+1,an+1,jn+1,xn+2)).

By induction, we proved that, the probability of any possible sample path in first MDP equals to the probability of its counterpart in the second MDP. As a subsequence of a sample path, each possible reward sequence for the two MDPs shares the same probability as well. Therefore, the total reward distributions for the two MDPs are the same. Theorem 3.3 is proved.

Notice that Theorem 3.3 focus on the total reward distribution instead of the return distribution. When the return distribution is considered, the reward function should be multiplied by to compensate the time drift effect, which is brought by the null states.

4 Conclusion and Discussion

In this paper, we illustrate the effect of the naive reward simplification on distribution, and generalized the transformation for MDPs in different settings. By implementing the transformation instead of simplifying the reward function in the naive way, the MDPs with different types of reward functions and (or) randomized policies are converted to the ones with deterministic and state-based reward functions, as well as an intact return (total reward) distribution. The transformation algorithm generalization presents a platform for the traditional value function and Q-function in risk-sensitive RL.

The transformation algorithm is suitable for stationary settings, where the MDP parameters does not vary with time. In a dynamic setting, algorithms (Dearden et al., 1998; Bellemare et al., 2017) based on the distributional Bellman equation (Morimura et al., 2010) can estimate the return distribution.

References

• Altman (1999) Altman, E. (1999). Constrained Markov Decision Processes. CRC Press.
• Artzner et al. (1998) Artzner, P., Delbaen, F., Eber, J., and Heath, D. (1998). Coherent measures of risk. Mathematical Finance, 9(3):1–24.
• Bellemare et al. (2017) Bellemare, M., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 449–458.
• Dearden et al. (1998) Dearden, R., Friedman, N., and Stuart, R. (1998). Bayesian Q-learning. In Proceedings of the 15th Association for the Advancement of Artificial Intelligence (AAAI), pages 761–768.
• Durbin (1973) Durbin, J. (1973). Distribution Theory for Tests based on the Sample Distribution Function. SIAM.
• Filar et al. (1995) Filar, J. A., Krass, D., Ross, K. W., and Member, S. (1995). Percentile performance criteria for limiting average Markov decision processes. IEEE Transactions on Automatic Control, 40(I):2–10.
• Jones (2004) Jones, G. L. (2004). On the Markov chain central limit theorem. Probability surveys, 1(299-320):5–1.
• Ma and Yu (2017) Ma, S. and Yu, J. Y. (2017). Transition-based versus state-based reward functions for MDPs with Value-at-Risk. In Proceedings of the 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 974–981.
• Mannor and Tsitsiklis (2011) Mannor, S. and Tsitsiklis, J. (2011). Mean-variance optimization in Markov decision processes. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1–22.
• Meyn and Tweedie (2009) Meyn, S. P. and Tweedie, R. L. (2009). Markov Chains and Stochastic Stability. Springer Science & Business Media.
• Morimura et al. (2010) Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., and Tanaka, T. (2010). Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 799–806.
• Nilim and Ghaoui (2005) Nilim, A. and Ghaoui, L. E. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
• Prashanth et al. (2016) Prashanth, L. A., Jie, C., Fu, M., Marcus, S., and Szepesvári, C. (2016). Cumulative prospect theory meets reinforcement learning: Prediction and control. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 1406–1415.
• Puterman (1994) Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
• Ruszczyński and Shapiro (2006) Ruszczyński, A. and Shapiro, A. (2006). Optimization of convex risk functions. Mathematics of Operations Research, 31(3):433–452.
• Scheffé (1999) Scheffé, H. (1999). The Analysis of Variance. John Wiley & Sons.
• Sobel (1982) Sobel, M. J. (1982). The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794–802.
• Sobel (1994) Sobel, M. J. (1994). Mean-variance tradeoffs in an undiscounted MDP. Operations Research, 42(1):175–183.
• White (1988) White, D. J. (1988). Mean , variance , and probabilistic criteria in finite Markov decision processes : A review. Journal of Optimization Theory and Applications, 56(1):1–29.
• Woodroofe (1992) Woodroofe, M. (1992). A central limit theorem for functions of a Markov chain with applications to shifts. Stochastic processes and their applications, 41(1):33–44.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters