Effect of Reward Function Choices in MDPs with ValueatRisk
Abstract
This paper studies ValueatRisk (VaR) problems in short and longhorizon Markov decision processes (MDPs) with finite state space and two different reward functions. Firstly we examine the effects of two reward functions under two criteria in a shorthorizon MDP. We show that under the VaR criterion, when the original reward function is on both current and next states, the reward simplification will change the VaR. Secondly, for longhorizon MDPs, we estimate the Pareto front of the total reward distribution set with the aid of spectral theory and the central limit theorem. Since the estimation is for a Markov process with the simplified reward function only, we present a transformation algorithm for the Markov process with the original reward function, in order to estimate the Pareto front with an intact total reward distribution.
1 Introduction
A Markov decision process (MDP) is a mathematical framework for formulating the discrete time stochastic control problems. This framework has two features, one is randomness, mainly reflected by transition probability, the other one is controllability, reflected by policy. These two features enable MDP as a natural tool in sequential decisionmaking for practical problems.
The standard class of optimality criteria concerns the expected total reward, which carries the expectation information of total reward cumulative distribution function (CDF) in several forms, such as the expected discounted total reward in a finite or infinitehorizon MDP, average reward in an infinitehorizon MDP, etc. [Derman, 1970, Puterman, 1994].
However, the expectation optimality criteria are not sufficient for many riskaverse problems, where the risk concerns arise not only mathematically but also psychologically. A classic example in psychology is the “St. Petersburg Paradox,” which refers to a lottery with an infinite expected reward, but people only prefer to pay a small amount to play. This problem is thoroughly studied in utility theory, and a recent study brought this idea to reinforcement learning [Prashanth et al., 2015]. A more mathematical example would be autonomous vehicles, in which a sufficient safety factor is more important than a highly expected performance. In general, when high reliability is concerned, the criterion should be formulated as probability instead of expectation.
Two risk criterion classes have been widely examined in recent years. One is the coherent risk measure [Artzner et al., 1998], which has a set of intuitively reasonable properties (convexity, for example). A thorough study in coherent risk optimization can be found in [Ruszczyński and Shapiro, 2006]. The other important class is the meanvariance measure [White, D. J., 1988, Sobel, 1994, Mannor and Tsitsiklis, 2011], in which the expected return is maximized for a given risk level (variance). It is also known as modern portfolio theory.
This paper studies valueatrisk (VaR), which originated from finance. For a given portfolio (an MDP with a policy), a loss threshold (target level), and a timehorizon, VaR concerns the probability that the loss on the portfolio exceeds the threshold over the time horizon. VaR is hard to deal with since it is not a coherent risk measure [Riedel, 2004].
When the criterion concerns the whole distribution instead of the expectation only, the simplification of the reward function will affect the optimal value. For example, the reward function (SAfunction) is widely used in many studies on MDP, and it is fine as long as the optimality criterion is presented as an expectation. However, when risk is involved, and the original reward function is (SASfunction), the simplification will lead to a nonoptimal policy.
In this paper, we study the VaR problems in short and longhorizon MDPs with finite state and action spaces, as well as the effect of the two reward functions. In Section 3, we use a shorthorizon MDP to illustrate that under the expected total reward criterion, MDPs with the two reward functions (SA and SASfunctions) have the same optimal expectation/policy, but different total reward distributions, which result in different VaRs. We also compare the augmentedstate 01 MDP method and the Pareto front generation for VaR criteria.
Our main contributions are described and discussed in Section 4, which include the following:

We propose a statetransition transformation algorithm for Markov reward processes derived from MDPs with SASfunctions, in order to estimate the total reward distribution. Since the CDF estimation method is for a Markov process with reward function , the proposed algorithm can transform a Markov process with reward function for the CDF estimation method, and keep the distribution intact.

We illustrate that both VaR criteria relate to the Pareto front of the total reward distribution set, and we estimate the Pareto front with the aid of spectral theory and the central limit theorem for longhorizon MDPs.
Besides, When the optimality criterion refers to the whole distribution instead of the expectation only, and the original reward function in MDP is , the reward function should not be simplified. For related studies which concerned VaR or other risksensitive criteria, we believe that they should be revisited using our proposed transformation approach instead of the reward simplification.
Related work: This paper adopts the VaR criteria defined in [Filar et al., 1995], which studied the VaR problems on the average reward by separating the state space into communicating and transient classes. Bouakiz and Kebir [Bouakiz and Kebir, 1995] pointed out that the cumulative reward is needed for the VaR criteria, and various properties of the optimality equations were studied in both finite and infinitehorizon MDPs. In a finitehorizon MDP, Wu and Lin [Wu and Lin, 1999] showed that the VaR optimal value functions are target distribution functions, and there exists a deterministic optimal policy. The structure property of optimal policy for an infinitehorizon MDP was also studied. Ohtsubo and Toyonaga [Ohtsubo and Toyonaga, 2002] gave two sufficient conditions for the existence of an optimal policy in infinitehorizon discounted MDPs, and another condition for the unique solution on a transient state set. For the VaR problem with a percentile , Delage and Mannor [Delage and Mannor, 2010] solved it as a convex “second order cone” program with reward or transition uncertainty. Different from most studies, Boda and Filar [Boda and Filar, 2006] and Kira et al. [Kira et al., 2012] considered the VaR criterion in a multiepoch setting, in which a risk measure is required to reach an appropriate level at not only the final epoch but also all intermediate epochs.
The VaR problem with a fixed threshold (target value) has been extensively studied. An augmentedstate 01 MDP was proposed for finitehorizon MDPs with either integer or realvalued reward functions. The cumulative reward space is included into the state space, and the states which satisfied the threshold was “tagged” by a Boolean reward function. The general reward discretizing error was also bounded [Xu and Mannor, 2011]. In a similar problem named MaxProb MDP, the goal states (in which the threshold is satisfied) were defined as absorbing states, and the problem was solved in a similar way [Kolobov et al., 2011]. Value iteration (VI) was proposed to solve the MaxProb MDP [Yu et al., 1998], and followed by some VI variants. In the topological value iteration (TVI) algorithm, states were separated into stronglyconnected groups, and efficiency was improved by solving the state groups sequentially [Dai et al., 2011]. Two methods were presented to separate the states efficiently by integrating depthfirst search (TVIDFS) and dynamic programming (TVIDP) [Hou et al., 2014]. For both exact and approximated algorithms for VaR with a threshold, the state of the art can be found in [Steinmetz et al., 2016].
Constrained probabilistic MDPs takes VaR as a constraint. The meanVaR portfolio optimization problem was solved with the Lagrange multiplier for the VaR constraint over a continuous time span [Yiu et al., 2004]. Bonami and Lejeune [Bonami and Lejeune, 2009] solved the meanvariance portfolio optimization problem, and used variants of Chebychev’s inequality to derive convex approximations of the quantile function. Randour et al. [Randour et al., 2015] converted the total discounted reward criterion to an almostsure percentile problem, and proposed an algorithm based on linear programming to solve the weighted multiconstraint percentile problem. It is also pointed out that randomized policy is necessary when VaR criterion is considered as a constraint, and an example can be found in [Defourny et al., 2008].
2 Preliminaries and Notations
A finitehorizon MDP,
is observed at decision epochs and ; is a finite state space, and denotes the state at epoch ; is the legitimate action set associated with each state , is a finite action space, and denotes the action at epoch ; is the bounded and measurable reward function, and denotes the reward (or cost if negative), given , , , and the action . This reward function has three arguments, and we name it SASfunction; denotes the homogeneous transition probability; is the initial state distribution; denotes the salvage function.
The optimal policy is determined by the optimality criterion. A policy refers to a sequence of decision rules (). Different forms of decision rule are used in different situations, and here we focus on deterministic Markovian decision rules.
In a finitehorizon MDP under an expectation criterion [Puterman, 1994], the SASfunction is usually simplified by
(1) 
Here we name the reward function SAfunction. It is suitable to simplify the reward function when the total reward expectation is considered, but when VaR is the criterion, it will lead to a nonoptimal result.
2.1 ValueatRisk Criteria
In this paper, we consider VaR instead of its riskneutral counterparts. Two VaR problems are considered [Filar et al., 1995]. Denote as the deterministic policy space with the time horizon . Given a policy and an initial distribution , we have the total reward , where . To simplify the notation we henceforth denote the total reward by . Denote as the total reward CDF with a policy . VaR addresses the following problems.
Problem 1.
Given a percentile , find .
This problem refers to the quantile function, i.e., .
Problem 2.
Given a threshold (target level) , find .
This problem concerns . Both VaR problems relate to the Pareto front of the CDF set , i.e., , for all . As will be illustrated below, when the horizon is short (Section 3), any point along is , and when the horizon is long (Section 4), and every (estimated) is strictly increasing, any point along is . Since there exists a deterministic optimal policy for finitehorizon MDPs under VaR criteria [Wu and Lin, 1999], we only consider the deterministic policy space.
The SAfunction is commonly used in most MDP studies even considering risk ([Filar et al., 1995] for example) instead of the SASfunction. However, under the VaR criteria, if the original reward function is an SASfunction, the simplification will miss the optimality, i.e., neither the policy nor the VaR is optimal. In an MDP with an SASfunction, the simplified SAfunction leads to the same optimal policy under the expected total reward criteria, but different optimal policies under the VaR criteria. Here we use a shorthorizon inventory control problem to illustrate the effect of reward function on the optimal value under two criteria, and what is VaR about.
3 ShortHorizon MDP for Inventory Problem
The inventory control problem is a straightforward example for illustrating the effect of the two reward functions, since the reward (sales volume) is related to both current and next states. In this shorthorizon MDP, we show that under the expected total reward criterion, the simplification (SAfunction) of the original reward function (SASfunction) will not affect the optimal value/policy, but change the total reward CDF.
3.1 MDP Description
This example is modified from ([Puterman, 1994], Section 3.2), and the complete problem description can be found in Appendix A. Briefly, the MDP for the shorthorizon inventory problem is as follows. The time horizon (); the state set defines all possible inventory levels; the action sets , , define all legitimate actions (orders) for each state; The two reward functions and the transition probabilities are illustrated in Figure 1. The labels along transitions denote the original SASfunction and the transition probability, and the labels in the text boxes near states denote the simplified SAfunction.^{2}^{2}2For example, the label below the transition from 0 to 1 means that the reward and the transition probability is 0.5; the label in the text box near state 0 means when and , the simplified reward . The bold parts are an example which illustrates the difference between the two reward functions. Besides, we set the initial state distribution , and the salvage reward , for all . Now we have two MDPs with different reward functions: and .
3.2 Expected Total Reward Criterion
Under the expected total reward criterion (nominal, discounted or average), the SAS and SAfunctions lead to the same optimal results, but different , which results in different VaRs. We illustrate this difference with a shorthorizon MDP, and without loss of generality, we consider the nominal expected total reward criterion. The optimal policy for both MDPs is , where and , for . The expected total reward .
As shown in Figure 2, under the expected total reward criterion, the simplification of SASfunction leads to a different total reward distribution. In the next section, we discuss the VaR criteria, which refers to the Pareto front of the CDF set, and since the reward simplification changes the total reward distribution, it will miss the optimal VaR.
3.3 VaR Criteria
Unlike the expected total reward criteria, the VaR optimality are not timeconsistent, so the backward induction cannot be implemented directly. One method is the augmentedstate 01 MDP [Xu and Mannor, 2011], which incorporate the cumulative reward space in the state space, and brings in the threshold and reorganizes the MDP components, in order to calculate the percentile in an expectation way.
3.3.1 AugmentedState 01 MDP Method
Since the cumulative reward information is needed for the optimality [Bouakiz and Kebir, 1995], an augmented state space is adopted to keep track of it. For shorthorizon MDPs under VaR criterion with a given threshold, Xu and Mannor [Xu and Mannor, 2011] presented a state augmentation method to include the cumulative reward in the state space. Define as the cumulative reward space, and as the minimum and maximum of the rewards for action . Then can be set as , or we can acquire by enumerating all possibile cumulative reward within a short horizon. Define the augmented state space for the new MDP. This state augmentation process is also used in several former studies [Bouakiz and Kebir, 1995, Wu and Lin, 1999, Ohtsubo and Toyonaga, 2002, Xu and Mannor, 2011].
For an MDP ^{3}^{3}3Notice that is original., set all reward values to zero and the salvage reward , where is the threshold (target level) and is the cumulative reward at the final epoch. This 01 MDP enables backward induction to calculate (defined in VaR Problem 2) as the expectation. Filar et al. [Filar et al., 1995] used the same “01” method for infinitehorizon MDPs under both VaR criteria. For all and , define the action space where ; define the transition kernel where , and ; define the initial distribution .
Now we have an augmentedstate 01 MDP . Here we proof that the optimal expected total reward of the new MDP equals the solution to VaR Problem 2 in the original MDP, i.e., , and then calculate it with the backward induction.
Lemma 1.
For every finitehorizon MDP, there exists an augmentedstate 01 MDP, in which the optimal expected total reward equals to the optimal VaR of the original MDP with a threshold .
Proof.
Given an augmentedstate 01 MDP , for all , implement the backward induction as follows. Step 1: Set and
Step 2: Set , and compute by
where , therefore,
Step 3: If , stop. Otherwise return to Step 2.
Since the only rewards are , we have , i.e., the probability that the total reward given any state at any epoch. The optimal policy derived from
gives the highest probability to reach the threshold. ∎
With the help of the new salvage reward function, we are enabled to implement backward induction to compute the corresponding percentile for the VaR criterion with a given threshold. The augmentedstate 01 MDP (Algorithm 1) is presented as follows.
In the implementation of the algorithm, it is worth noting that, in most instances, it is more efficient to deal with the state space in a timedependent way, i.e., at each epoch, only a subspace of is feasible.
Now we use the augmentedstate 01 MDP method to solve the inventory control problem described in Section 3.1. We consider the VaR criterion with a given threshold . In the MDP with the SASfunction, the optimal policy for the first MDP is , where , , 1 or 2, , , and . And the optimal percentile is . In the MDP with the SAfunction, the optimal policy for the second (simplified) MDP is , where , . And the optimal percentile is . The conclusion drawn in Section 3.1, which claims that the reward simplification changes the VaR, is verified here.
However, the augmentedstate 01 MDP method is for VaR Problem 2 with a specified threshold only. In order to achieve all optimal VaRs with any or , we can enumerate all the deterministic policies on the augmented state space to acquire the Pareto front .
Remark (Pareto front in a short horizon).
Given the Paerto front for a shorthorizon MDP, for any , , since
Figure 3 shows the Pareto fronts for MDPs with the two reward functions. It illustrates that the simplification of reward function changes the VaR (Pareto front). Given the two Pareto fronts, we can verify the solution to the VaR problem with a specified threshold . Furthermore, for any threshold , we can acquire along the curves.^{4}^{4}4For example, when , for the MDP with SASfunction is 0.6875 (10.3125), and for the MDP with SAfunction is 0.25 (10.75). Table 1 shows the comparison between the two methods.
Augmentedstate 01 MDP  Pareto front generation 

Short horizon  Long horizon 
Exact result  Estimated result for longhorizon MDPs 
VaR Problem 2 only  Both VaR Problems 
Backward induction with cumulative reward space  Enumerating stationary policy 

In conclusion, the reward simplification changes the VaR for a shorthorizon MDP. Under the VaR criterion with a threshold, the augmentedstate 01 MDP method works well for enabling the backward induction algorithm. However, this method fails for longhorizon MDPs, and it works for the VaR problem 2 with a specified threshold only. Since both VaR problems relate to the Pareto front of the total reward CDF set, how to obtain the Pareto front in a short horizon needs further study.
4 VaR Criteria in LongHorizon MDPs
Since it is intractable to find the exact optimal policy for a longhorizon MDP under some the VaR criterion, we look for a deterministic stationary policy instead, i.e., . With the aid of spectral theory and the central limit theorem, we can estimate the total reward CDF set for an MDP with SAfunction by enumerating all the deterministic policies. In order to implement the method to MDPs with SASfunctions, we present an algorithm to transform a Markov process with the reward function to one with , where . This method is for both VaR problems.
4.1 Total Reward CDF Estimation
Firstly we estimate the CDF in a longhorizon Markov reward process derived from an MDP with SAfunction. Given an MDP ^{5}^{5}5Though is ignored when the horizon is long, it can be involved if necessary. and a deterministic policy , we have a Markov reward process . For , the reward is , and the transition kernel is .
Kontoyiannis and Meyn [Kontoyiannis and Meyn, 2003] proposed a method to estimate . In a positive recurrent Markov process with invariant probability measure (stationary distribution) , the total reward , and the averaged reward , which can be expressed as . Define the limit , which solves the Poisson equation
where is the transition matrix and . Two assumptions ([Kontoyiannis and Meyn, 2003], Section 4) are needed for the CDF estimation.
Assumption 1.
The Markov process is geometrically ergodic with a Lyapunov function such that .
Assumption 2.
The (measurable) function has zero mean and nontrivial asymptotic variance .
Under the two assumptions, we show the Edgeworth expansion theorem for nonlattice functionals (Theorem 5.1 in [Kontoyiannis and Meyn, 2003]) as follows.
Theorem 1.
Suppose that and the strongly nonlattice functional satisfy Assumptions 1 and 2, and let denote the distribution function of the normalized partial sums :
Then, for all and as ,
where denotes the standard normal density, is the corresponding distribution function, and is a constant related to the third moment of . The formulae for , and can be found in Appendix B.
4.2 StateTransition Transformation
For a Markov process derived from an MDP with an SASfunction and a stationary policy , we cannot apply the method directly since the reward function of the Markov process is . If the reward function is simplified by Equation (1), the VaR will be affected as illustrated in Section 3. In order to implement the estimation, we propose a method to transform the Markov process to a Markov process with a reward function which shares the same of the original Markov process.
Figure 4 illustrates what roles the states and transitions play in the original Markov process (above) and its transformed counterpart (below). In the original Markov process, denotes a state and denotes a transition from to . In a transformed Markov process, the original state becomes a “router” , which connects input nodes (transformed states) to output nodes .
Since we take state transitions as states, we name this algorithm statetransition transformation. It is clear that for the new Markov process is equivalent to that of the original one. In short, for MDPs with SASfunctions, each stationary policy leads to a Markov process with the reward function , and in order to implement the CDF estimation without simplifying the reward function by Equation (1), we implement the statetransition transformation (Algorithm 2) to generate a Markov process with the same and a reward function .
In the same MDP setup outlined in Section 3.1, we estimate in a long horizon. We set and implement the statetransition transformation to an MDP with an SASfunction under a stationary policy. In Figure 5, we can see that the simplification of the SASfunction changes the VaR when the horizon is long, and the Pareto front with SASfunction has a wider support than that with SAfunction.
Remark (Pareto front in a long horizon).
In a longhorizon MDP, given an estimated which is strictly increasing, for any , there exists a unique s.t. , and
5 Conclusion and Discussions
In this paper, we studied short and longhorizon MDPs with finite state space under VaR criteria, and the effect of the simplification of reward function. In shorthorizon MDPs, firstly we illustrated that when the original reward function is an SASfunction, the reward simplification does not affect the optimal value/policy under the expected total reward criterion, but changes the total reward CDF. Secondly we considered VaR criteria, we solved the VaR Problem 2 with the augmentedstate 01 MDP method in an expectation way, and we enumerated all policies to obtain the Pareto front of the total reward CDF set. when the horizon is long, we estimate for every deterministic policy in order to obtain . Since the estimation method is only for Markov processes derived from MDPs with SAfunction, we propose a transformation algorithm to make it feasible for the MDPs with SASfunction.
The statetransition transformation enables the original transitions to have properties as states. Is there a similar transition for MDP, which can convert the MDP with SASfunction to an MDP with SAfunction with an equivalent total reward distribution? Two components of MDP will deteriorate if the transformation is applied directly to MDPs. The initial state distribution will be determined by the decision rule at the first epoch, and the salvage reward will be determined by the decision rule at the final epoch. When we concern the longrun performance, and both effects can be ignored, we can implement a similar MDP transformation directly.
VaR concerns the thresholdpercentile pair, and the optimality of one comes into conflict with the other as they are virtually nonincreasing functions of each other [Filar et al., 1995]. One future study is to estimate the Pareto front without enumerating all the policies. A special case is that there exists an optimal policy , i.e., . Ohtsubo and Toyonaga [Ohtsubo and Toyonaga, 2002] gave two sufficient conditions for the existence of this optimal policy in infinitehorizon discounted MDPs. Another idea is to consider it as a dualobjective optimization. Zheng [Zheng, 2009] studied the dualobjective MDP concerning variance and CVaR, which might provide some insight.
Under a VaR criterion, the simplification of reward function affects the VaR. We believe that some practical problems with respect to VaR should be revisited using our proposed transformation approach when the reward function is an SASfunction.
Appendices
Appendix A Inventory Problem Description
Section 3.2.1 in [Puterman, 1994] described the model formulation and some assumptions for a singleproduct stochastic inventory control problem. Briefly, at , define as the inventory level before the order, as the order quantity, as the demand with a timehomogeneous probability distribution , where , then we have .
For , define as the cost to order units, and a fixed cost for placing orders, then we have the order cost . denotes the revenue when units of demand is fulfilled. Then we have the SASfunction . Here we ignore the maintenance fee to simplify the problem.
We set the parameters as follows. The time horizon , the fixed order cost , the variable order cost , the salvage reward , the warehouse capacity , and the price . The probabilities of demands are , , respectively. Initial distribution , i.e., . Firstly we calculate the SASfunction by . Secondly, we calculate the SAfunction by Equation (1). Now we have two MDPs with different reward functions: and .
Appendix B CDF Estimation for LongHorizon Markov Reward Process
is a constant related to the third moment of the sum . As described in [Kontoyiannis and Meyn, 2003], Section 5 and [Yu et al., 2015], , where
and
As studied in [Glynn and Meyn, 1996] and [Yu et al., 2015], firstly define a kernel , and . Then obtain the fundamental kernel if exists. Glynn and Meyn [Glynn and Meyn, 1996] showed that
Theorem 17.4.4 in [Meyn and Tweedie, 2009] showed that when and , the asymptotic variance can be calculated by
References
 [Artzner et al., 1998] Artzner, P., Delbaen, F., Eber, J., and Heath, D. (1998). Coherent measures of risk. Mathematical Finance, 9(3):1–24.
 [Boda and Filar, 2006] Boda, K. and Filar, J. A. (2006). Time Consistent Dynamic Risk Measures. Mathematical Methods of Operations Research, 63(1):169–186.
 [Bonami and Lejeune, 2009] Bonami, P. and Lejeune, M. A. (2009). An Exact Solution Approach for Portfolio Optimization Problems Under Stochastic and Integer Constraints. Operations Research, 57(3):650–670.
 [Bouakiz and Kebir, 1995] Bouakiz, M. and Kebir, Y. (1995). Targetlevel criterion in Markov decision processes. Journal of Optimization Theory and Applications, 86(1):1–15.
 [Dai et al., 2011] Dai, P., Weld, D. S., and Goldsmith, J. (2011). Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42:181–209.
 [Defourny et al., 2008] Defourny, B., Ernst, D., and Wehenkel, L. (2008). RiskAware Decision Making and Dynamic Programming. In Proceedings of NIPS08 Workshop on Model Uncertainty and Risk in Reinforcement Learning, pages 1–8.
 [Delage and Mannor, 2010] Delage, E. and Mannor, S. (2010). Percentile optimization for markov decision processes with parameter uncertainty. Operations research, 58(1):203–213.
 [Derman, 1970] Derman, C. (1970). Finite State Markovian Decision Processes. Academic Press, Inc.
 [Filar et al., 1995] Filar, J. A., Krass, D., Ross, K. W., and Member, S. (1995). Percentile Performance Criteria For Limiting Average Markov Decision Processes. IEEE Transactions on Automatic Control, 40(I):2–10.
 [Glynn and Meyn, 1996] Glynn, P. W. and Meyn, S. P. (1996). A lyapunov bound for solutions of poisson’s equation. The Annals of Probability, pages 916–931.
 [Hou et al., 2014] Hou, P., Yeoh, W., and Varakantham, P. (2014). Revisiting risksensitive mdps: New algorithms and results. In Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), pages 136–144.
 [Kira et al., 2012] Kira, A., Ueno, T., and Fujita, T. (2012). Threshold probability of nonterminal type in finite horizon Markov decision processes. Journal of Mathematical Analysis and Applications, 386(1):461–472.
 [Kolobov et al., 2011] Kolobov, A., Mausam, Weld, D. S., and Geffner, H. (2011). Heuristic search for generalized stochastic shortest path mdps. In Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), pages 130–137.
 [Kontoyiannis and Meyn, 2003] Kontoyiannis, I. and Meyn, S. P. (2003). Spectral theory and limit theorems for geometrically ergodic markov processes. Annals of Applied Probability, 13:304–362.
 [Mannor and Tsitsiklis, 2011] Mannor, S. and Tsitsiklis, J. (2011). MeanVariance Optimization in Markov Decision Processes. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1–22.
 [Meyn and Tweedie, 2009] Meyn, S. P. and Tweedie, R. L. (2009). Markov chains and stochastic stability. Springer Science & Business Media.
 [Ohtsubo and Toyonaga, 2002] Ohtsubo, Y. and Toyonaga, K. (2002). Optimal policy for minimizing risk models in Markov decision processes. Journal of mathematical analysis and applications, 271(1):66–81.
 [Prashanth et al., 2015] Prashanth, L. A., Cheng, J., Fu, M., Marcus, S., and Jun, L. G. (2015). Cumulative Prospect Theory Meets Reinforcement Learning : Estimation and Control. Working Paper, pages 1–27.
 [Puterman, 1994] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
 [Randour et al., 2015] Randour, M., Raskin, J., and Sankur, O. (2015). Percentile Queries in Multidimensional Markov Decision Processes. Computer Aided Verification, 9206:123–139.
 [Riedel, 2004] Riedel, F. (2004). Dynamic coherent risk measures. Stochastic Processes and their Applications, 112(2):185–200.
 [Ruszczyński and Shapiro, 2006] Ruszczyński, A. and Shapiro, A. (2006). Optimization of Convex Risk Functions. Mathematics of Operations Research, 31(3):433–452.
 [Sobel, 1994] Sobel, M. J. (1994). MeanVariance Tradeoffs in an Undiscounted MDP. Operations Research, 42(1):175–183.
 [Steinmetz et al., 2016] Steinmetz, M., Hoffmann, J., and Buffet, O. (2016). Goal probability analysis in mdp probabilistic planning: Exploring and enhancing the state of the art. Journal of Artificial Intelligence Research, 57:229–271.
 [White, D. J., 1988] White, D. J. (1988). Mean , Variance , and Probabilistic Criteria in Finite Markov Decision Processes : A Review. Journal of Optimization Theory and Applications, 56(1):1–29.
 [Wu and Lin, 1999] Wu, C. and Lin, Y. (1999). Minimizing Risk models in Markov decision process with policies depending on target values. Journal of Mathematical Analysis and Applications, 23(1):47–67.
 [Xu and Mannor, 2011] Xu, H. and Mannor, S. (2011). Probabilistic goal Markov decision processes. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 2046–2052.
 [Yiu et al., 2004] Yiu, K. F. C., Wang, S. Y., and Mak, K. L. (2004). Optimal portfolios under a valueatrisk constraint. Journal of Economic Dynamics and Control, 28(7):1317–1334.
 [Yu et al., 2015] Yu, P., Yu, J. Y., and Xu, H. (2015). Centrallimit approach to riskaware markov decision processes. arXiv:1512.00583.
 [Yu et al., 1998] Yu, S. X., Lin, Y., and Yan, P. (1998). Optimization models for the first arrival target distribution function in discrete time. Journal of mathematical analysis and applications, 225(1):193–223.
 [Zheng, 2009] Zheng, H. (2009). Efficient frontier of utility and CVaR. Mathematical Methods of Operations Research, 70(1):129–148.