Effect of Reward Function Choices in MDPs with Value-at-Risk
This paper studies Value-at-Risk (VaR) problems in short- and long-horizon Markov decision processes (MDPs) with finite state space and two different reward functions. Firstly we examine the effects of two reward functions under two criteria in a short-horizon MDP. We show that under the VaR criterion, when the original reward function is on both current and next states, the reward simplification will change the VaR. Secondly, for long-horizon MDPs, we estimate the Pareto front of the total reward distribution set with the aid of spectral theory and the central limit theorem. Since the estimation is for a Markov process with the simplified reward function only, we present a transformation algorithm for the Markov process with the original reward function, in order to estimate the Pareto front with an intact total reward distribution.
A Markov decision process (MDP) is a mathematical framework for formulating the discrete time stochastic control problems. This framework has two features, one is randomness, mainly reflected by transition probability, the other one is controllability, reflected by policy. These two features enable MDP as a natural tool in sequential decision-making for practical problems.
The standard class of optimality criteria concerns the expected total reward, which carries the expectation information of total reward cumulative distribution function (CDF) in several forms, such as the expected discounted total reward in a finite- or infinite-horizon MDP, average reward in an infinite-horizon MDP, etc. [Derman, 1970, Puterman, 1994].
However, the expectation optimality criteria are not sufficient for many risk-averse problems, where the risk concerns arise not only mathematically but also psychologically. A classic example in psychology is the “St. Petersburg Paradox,” which refers to a lottery with an infinite expected reward, but people only prefer to pay a small amount to play. This problem is thoroughly studied in utility theory, and a recent study brought this idea to reinforcement learning [Prashanth et al., 2015]. A more mathematical example would be autonomous vehicles, in which a sufficient safety factor is more important than a highly expected performance. In general, when high reliability is concerned, the criterion should be formulated as probability instead of expectation.
Two risk criterion classes have been widely examined in recent years. One is the coherent risk measure [Artzner et al., 1998], which has a set of intuitively reasonable properties (convexity, for example). A thorough study in coherent risk optimization can be found in [Ruszczyński and Shapiro, 2006]. The other important class is the mean-variance measure [White, D. J., 1988, Sobel, 1994, Mannor and Tsitsiklis, 2011], in which the expected return is maximized for a given risk level (variance). It is also known as modern portfolio theory.
This paper studies value-at-risk (VaR), which originated from finance. For a given portfolio (an MDP with a policy), a loss threshold (target level), and a time-horizon, VaR concerns the probability that the loss on the portfolio exceeds the threshold over the time horizon. VaR is hard to deal with since it is not a coherent risk measure [Riedel, 2004].
When the criterion concerns the whole distribution instead of the expectation only, the simplification of the reward function will affect the optimal value. For example, the reward function (SA-function) is widely used in many studies on MDP, and it is fine as long as the optimality criterion is presented as an expectation. However, when risk is involved, and the original reward function is (SAS-function), the simplification will lead to a non-optimal policy.
In this paper, we study the VaR problems in short and long-horizon MDPs with finite state and action spaces, as well as the effect of the two reward functions. In Section 3, we use a short-horizon MDP to illustrate that under the expected total reward criterion, MDPs with the two reward functions (SA- and SAS-functions) have the same optimal expectation/policy, but different total reward distributions, which result in different VaRs. We also compare the augmented-state 0-1 MDP method and the Pareto front generation for VaR criteria.
Our main contributions are described and discussed in Section 4, which include the following:
We propose a state-transition transformation algorithm for Markov reward processes derived from MDPs with SAS-functions, in order to estimate the total reward distribution. Since the CDF estimation method is for a Markov process with reward function , the proposed algorithm can transform a Markov process with reward function for the CDF estimation method, and keep the distribution intact.
We illustrate that both VaR criteria relate to the Pareto front of the total reward distribution set, and we estimate the Pareto front with the aid of spectral theory and the central limit theorem for long-horizon MDPs.
Besides, When the optimality criterion refers to the whole distribution instead of the expectation only, and the original reward function in MDP is , the reward function should not be simplified. For related studies which concerned VaR or other risk-sensitive criteria, we believe that they should be revisited using our proposed transformation approach instead of the reward simplification.
Related work: This paper adopts the VaR criteria defined in [Filar et al., 1995], which studied the VaR problems on the average reward by separating the state space into communicating and transient classes. Bouakiz and Kebir [Bouakiz and Kebir, 1995] pointed out that the cumulative reward is needed for the VaR criteria, and various properties of the optimality equations were studied in both finite and infinite-horizon MDPs. In a finite-horizon MDP, Wu and Lin [Wu and Lin, 1999] showed that the VaR optimal value functions are target distribution functions, and there exists a deterministic optimal policy. The structure property of optimal policy for an infinite-horizon MDP was also studied. Ohtsubo and Toyonaga [Ohtsubo and Toyonaga, 2002] gave two sufficient conditions for the existence of an optimal policy in infinite-horizon discounted MDPs, and another condition for the unique solution on a transient state set. For the VaR problem with a percentile , Delage and Mannor [Delage and Mannor, 2010] solved it as a convex “second order cone” program with reward or transition uncertainty. Different from most studies, Boda and Filar [Boda and Filar, 2006] and Kira et al. [Kira et al., 2012] considered the VaR criterion in a multi-epoch setting, in which a risk measure is required to reach an appropriate level at not only the final epoch but also all intermediate epochs.
The VaR problem with a fixed threshold (target value) has been extensively studied. An augmented-state 0-1 MDP was proposed for finite-horizon MDPs with either integer or real-valued reward functions. The cumulative reward space is included into the state space, and the states which satisfied the threshold was “tagged” by a Boolean reward function. The general reward discretizing error was also bounded [Xu and Mannor, 2011]. In a similar problem named MaxProb MDP, the goal states (in which the threshold is satisfied) were defined as absorbing states, and the problem was solved in a similar way [Kolobov et al., 2011]. Value iteration (VI) was proposed to solve the MaxProb MDP [Yu et al., 1998], and followed by some VI variants. In the topological value iteration (TVI) algorithm, states were separated into strongly-connected groups, and efficiency was improved by solving the state groups sequentially [Dai et al., 2011]. Two methods were presented to separate the states efficiently by integrating depth-first search (TVI-DFS) and dynamic programming (TVI-DP) [Hou et al., 2014]. For both exact and approximated algorithms for VaR with a threshold, the state of the art can be found in [Steinmetz et al., 2016].
Constrained probabilistic MDPs takes VaR as a constraint. The mean-VaR portfolio optimization problem was solved with the Lagrange multiplier for the VaR constraint over a continuous time span [Yiu et al., 2004]. Bonami and Lejeune [Bonami and Lejeune, 2009] solved the mean-variance portfolio optimization problem, and used variants of Chebychev’s inequality to derive convex approximations of the quantile function. Randour et al. [Randour et al., 2015] converted the total discounted reward criterion to an almost-sure percentile problem, and proposed an algorithm based on linear programming to solve the weighted multi-constraint percentile problem. It is also pointed out that randomized policy is necessary when VaR criterion is considered as a constraint, and an example can be found in [Defourny et al., 2008].
2 Preliminaries and Notations
A finite-horizon MDP,
is observed at decision epochs and ; is a finite state space, and denotes the state at epoch ; is the legitimate action set associated with each state , is a finite action space, and denotes the action at epoch ; is the bounded and measurable reward function, and denotes the reward (or cost if negative), given , , , and the action . This reward function has three arguments, and we name it SAS-function; denotes the homogeneous transition probability; is the initial state distribution; denotes the salvage function.
The optimal policy is determined by the optimality criterion. A policy refers to a sequence of decision rules (). Different forms of decision rule are used in different situations, and here we focus on deterministic Markovian decision rules.
In a finite-horizon MDP under an expectation criterion [Puterman, 1994], the SAS-function is usually simplified by
Here we name the reward function SA-function. It is suitable to simplify the reward function when the total reward expectation is considered, but when VaR is the criterion, it will lead to a non-optimal result.
2.1 Value-at-Risk Criteria
In this paper, we consider VaR instead of its risk-neutral counterparts. Two VaR problems are considered [Filar et al., 1995]. Denote as the deterministic policy space with the time horizon . Given a policy and an initial distribution , we have the total reward , where . To simplify the notation we henceforth denote the total reward by . Denote as the total reward CDF with a policy . VaR addresses the following problems.
Given a percentile , find .
This problem refers to the quantile function, i.e., .
Given a threshold (target level) , find .
This problem concerns . Both VaR problems relate to the Pareto front of the CDF set , i.e., , for all . As will be illustrated below, when the horizon is short (Section 3), any point along is , and when the horizon is long (Section 4), and every (estimated) is strictly increasing, any point along is . Since there exists a deterministic optimal policy for finite-horizon MDPs under VaR criteria [Wu and Lin, 1999], we only consider the deterministic policy space.
The SA-function is commonly used in most MDP studies even considering risk ([Filar et al., 1995] for example) instead of the SAS-function. However, under the VaR criteria, if the original reward function is an SAS-function, the simplification will miss the optimality, i.e., neither the policy nor the VaR is optimal. In an MDP with an SAS-function, the simplified SA-function leads to the same optimal policy under the expected total reward criteria, but different optimal policies under the VaR criteria. Here we use a short-horizon inventory control problem to illustrate the effect of reward function on the optimal value under two criteria, and what is VaR about.
3 Short-Horizon MDP for Inventory Problem
The inventory control problem is a straightforward example for illustrating the effect of the two reward functions, since the reward (sales volume) is related to both current and next states. In this short-horizon MDP, we show that under the expected total reward criterion, the simplification (SA-function) of the original reward function (SAS-function) will not affect the optimal value/policy, but change the total reward CDF.
3.1 MDP Description
This example is modified from ([Puterman, 1994], Section 3.2), and the complete problem description can be found in Appendix A. Briefly, the MDP for the short-horizon inventory problem is as follows. The time horizon (); the state set defines all possible inventory levels; the action sets , , define all legitimate actions (orders) for each state; The two reward functions and the transition probabilities are illustrated in Figure 1. The labels along transitions denote the original SAS-function and the transition probability, and the labels in the text boxes near states denote the simplified SA-function.222For example, the label below the transition from 0 to 1 means that the reward and the transition probability is 0.5; the label in the text box near state 0 means when and , the simplified reward . The bold parts are an example which illustrates the difference between the two reward functions. Besides, we set the initial state distribution , and the salvage reward , for all . Now we have two MDPs with different reward functions: and .
3.2 Expected Total Reward Criterion
Under the expected total reward criterion (nominal, discounted or average), the SAS- and SA-functions lead to the same optimal results, but different , which results in different VaRs. We illustrate this difference with a short-horizon MDP, and without loss of generality, we consider the nominal expected total reward criterion. The optimal policy for both MDPs is , where and , for . The expected total reward .
As shown in Figure 2, under the expected total reward criterion, the simplification of SAS-function leads to a different total reward distribution. In the next section, we discuss the VaR criteria, which refers to the Pareto front of the CDF set, and since the reward simplification changes the total reward distribution, it will miss the optimal VaR.
3.3 VaR Criteria
Unlike the expected total reward criteria, the VaR optimality are not time-consistent, so the backward induction cannot be implemented directly. One method is the augmented-state 0-1 MDP [Xu and Mannor, 2011], which incorporate the cumulative reward space in the state space, and brings in the threshold and reorganizes the MDP components, in order to calculate the percentile in an expectation way.
3.3.1 Augmented-State 0-1 MDP Method
Since the cumulative reward information is needed for the optimality [Bouakiz and Kebir, 1995], an augmented state space is adopted to keep track of it. For short-horizon MDPs under VaR criterion with a given threshold, Xu and Mannor [Xu and Mannor, 2011] presented a state augmentation method to include the cumulative reward in the state space. Define as the cumulative reward space, and as the minimum and maximum of the rewards for action . Then can be set as , or we can acquire by enumerating all possibile cumulative reward within a short horizon. Define the augmented state space for the new MDP. This state augmentation process is also used in several former studies [Bouakiz and Kebir, 1995, Wu and Lin, 1999, Ohtsubo and Toyonaga, 2002, Xu and Mannor, 2011].
For an MDP 333Notice that is original., set all reward values to zero and the salvage reward , where is the threshold (target level) and is the cumulative reward at the final epoch. This 0-1 MDP enables backward induction to calculate (defined in VaR Problem 2) as the expectation. Filar et al. [Filar et al., 1995] used the same “0-1” method for infinite-horizon MDPs under both VaR criteria. For all and , define the action space where ; define the transition kernel where , and ; define the initial distribution .
Now we have an augmented-state 0-1 MDP . Here we proof that the optimal expected total reward of the new MDP equals the solution to VaR Problem 2 in the original MDP, i.e., , and then calculate it with the backward induction.
For every finite-horizon MDP, there exists an augmented-state 0-1 MDP, in which the optimal expected total reward equals to the optimal VaR of the original MDP with a threshold .
Given an augmented-state 0-1 MDP , for all , implement the backward induction as follows. Step 1: Set and
Step 2: Set , and compute by
where , therefore,
Step 3: If , stop. Otherwise return to Step 2.
Since the only rewards are , we have , i.e., the probability that the total reward given any state at any epoch. The optimal policy derived from
gives the highest probability to reach the threshold. ∎
With the help of the new salvage reward function, we are enabled to implement backward induction to compute the corresponding percentile for the VaR criterion with a given threshold. The augmented-state 0-1 MDP (Algorithm 1) is presented as follows.
In the implementation of the algorithm, it is worth noting that, in most instances, it is more efficient to deal with the state space in a time-dependent way, i.e., at each epoch, only a subspace of is feasible.
Now we use the augmented-state 0-1 MDP method to solve the inventory control problem described in Section 3.1. We consider the VaR criterion with a given threshold . In the MDP with the SAS-function, the optimal policy for the first MDP is , where , , 1 or 2, , , and . And the optimal percentile is . In the MDP with the SA-function, the optimal policy for the second (simplified) MDP is , where , . And the optimal percentile is . The conclusion drawn in Section 3.1, which claims that the reward simplification changes the VaR, is verified here.
However, the augmented-state 0-1 MDP method is for VaR Problem 2 with a specified threshold only. In order to achieve all optimal VaRs with any or , we can enumerate all the deterministic policies on the augmented state space to acquire the Pareto front .
Remark (Pareto front in a short horizon).
Given the Paerto front for a short-horizon MDP, for any , , since
Figure 3 shows the Pareto fronts for MDPs with the two reward functions. It illustrates that the simplification of reward function changes the VaR (Pareto front). Given the two Pareto fronts, we can verify the solution to the VaR problem with a specified threshold . Furthermore, for any threshold , we can acquire along the curves.444For example, when , for the MDP with SAS-function is 0.6875 (1-0.3125), and for the MDP with SA-function is 0.25 (1-0.75). Table 1 shows the comparison between the two methods.
|Augmented-state 0-1 MDP||Pareto front generation|
|Short horizon||Long horizon|
|Exact result||Estimated result for long-horizon MDPs|
|VaR Problem 2 only||Both VaR Problems|
|Backward induction with cumulative reward space||Enumerating stationary policy|
In conclusion, the reward simplification changes the VaR for a short-horizon MDP. Under the VaR criterion with a threshold, the augmented-state 0-1 MDP method works well for enabling the backward induction algorithm. However, this method fails for long-horizon MDPs, and it works for the VaR problem 2 with a specified threshold only. Since both VaR problems relate to the Pareto front of the total reward CDF set, how to obtain the Pareto front in a short horizon needs further study.
4 VaR Criteria in Long-Horizon MDPs
Since it is intractable to find the exact optimal policy for a long-horizon MDP under some the VaR criterion, we look for a deterministic stationary policy instead, i.e., . With the aid of spectral theory and the central limit theorem, we can estimate the total reward CDF set for an MDP with SA-function by enumerating all the deterministic policies. In order to implement the method to MDPs with SAS-functions, we present an algorithm to transform a Markov process with the reward function to one with , where . This method is for both VaR problems.
4.1 Total Reward CDF Estimation
Firstly we estimate the CDF in a long-horizon Markov reward process derived from an MDP with SA-function. Given an MDP 555Though is ignored when the horizon is long, it can be involved if necessary. and a deterministic policy , we have a Markov reward process . For , the reward is , and the transition kernel is .
Kontoyiannis and Meyn [Kontoyiannis and Meyn, 2003] proposed a method to estimate . In a positive recurrent Markov process with invariant probability measure (stationary distribution) , the total reward , and the averaged reward , which can be expressed as . Define the limit , which solves the Poisson equation
where is the transition matrix and . Two assumptions ([Kontoyiannis and Meyn, 2003], Section 4) are needed for the CDF estimation.
The Markov process is geometrically ergodic with a Lyapunov function such that .
The (measurable) function has zero mean and nontrivial asymptotic variance .
Under the two assumptions, we show the Edgeworth expansion theorem for nonlattice functionals (Theorem 5.1 in [Kontoyiannis and Meyn, 2003]) as follows.
Suppose that and the strongly nonlattice functional satisfy Assumptions 1 and 2, and let denote the distribution function of the normalized partial sums :
Then, for all and as ,
where denotes the standard normal density, is the corresponding distribution function, and is a constant related to the third moment of . The formulae for , and can be found in Appendix B.
4.2 State-Transition Transformation
For a Markov process derived from an MDP with an SAS-function and a stationary policy , we cannot apply the method directly since the reward function of the Markov process is . If the reward function is simplified by Equation (1), the VaR will be affected as illustrated in Section 3. In order to implement the estimation, we propose a method to transform the Markov process to a Markov process with a reward function which shares the same of the original Markov process.
Figure 4 illustrates what roles the states and transitions play in the original Markov process (above) and its transformed counterpart (below). In the original Markov process, denotes a state and denotes a transition from to . In a transformed Markov process, the original state becomes a “router” , which connects input nodes (transformed states) to output nodes .
Since we take state transitions as states, we name this algorithm state-transition transformation. It is clear that for the new Markov process is equivalent to that of the original one. In short, for MDPs with SAS-functions, each stationary policy leads to a Markov process with the reward function , and in order to implement the CDF estimation without simplifying the reward function by Equation (1), we implement the state-transition transformation (Algorithm 2) to generate a Markov process with the same and a reward function .
In the same MDP setup outlined in Section 3.1, we estimate in a long horizon. We set and implement the state-transition transformation to an MDP with an SAS-function under a stationary policy. In Figure 5, we can see that the simplification of the SAS-function changes the VaR when the horizon is long, and the Pareto front with SAS-function has a wider support than that with SA-function.
Remark (Pareto front in a long horizon).
In a long-horizon MDP, given an estimated which is strictly increasing, for any , there exists a unique s.t. , and
5 Conclusion and Discussions
In this paper, we studied short- and long-horizon MDPs with finite state space under VaR criteria, and the effect of the simplification of reward function. In short-horizon MDPs, firstly we illustrated that when the original reward function is an SAS-function, the reward simplification does not affect the optimal value/policy under the expected total reward criterion, but changes the total reward CDF. Secondly we considered VaR criteria, we solved the VaR Problem 2 with the augmented-state 0-1 MDP method in an expectation way, and we enumerated all policies to obtain the Pareto front of the total reward CDF set. when the horizon is long, we estimate for every deterministic policy in order to obtain . Since the estimation method is only for Markov processes derived from MDPs with SA-function, we propose a transformation algorithm to make it feasible for the MDPs with SAS-function.
The state-transition transformation enables the original transitions to have properties as states. Is there a similar transition for MDP, which can convert the MDP with SAS-function to an MDP with SA-function with an equivalent total reward distribution? Two components of MDP will deteriorate if the transformation is applied directly to MDPs. The initial state distribution will be determined by the decision rule at the first epoch, and the salvage reward will be determined by the decision rule at the final epoch. When we concern the long-run performance, and both effects can be ignored, we can implement a similar MDP transformation directly.
VaR concerns the threshold-percentile pair, and the optimality of one comes into conflict with the other as they are virtually non-increasing functions of each other [Filar et al., 1995]. One future study is to estimate the Pareto front without enumerating all the policies. A special case is that there exists an optimal policy , i.e., . Ohtsubo and Toyonaga [Ohtsubo and Toyonaga, 2002] gave two sufficient conditions for the existence of this optimal policy in infinite-horizon discounted MDPs. Another idea is to consider it as a dual-objective optimization. Zheng [Zheng, 2009] studied the dual-objective MDP concerning variance and CVaR, which might provide some insight.
Under a VaR criterion, the simplification of reward function affects the VaR. We believe that some practical problems with respect to VaR should be revisited using our proposed transformation approach when the reward function is an SAS-function.
Appendix A Inventory Problem Description
Section 3.2.1 in [Puterman, 1994] described the model formulation and some assumptions for a single-product stochastic inventory control problem. Briefly, at , define as the inventory level before the order, as the order quantity, as the demand with a time-homogeneous probability distribution , where , then we have .
For , define as the cost to order units, and a fixed cost for placing orders, then we have the order cost . denotes the revenue when units of demand is fulfilled. Then we have the SAS-function . Here we ignore the maintenance fee to simplify the problem.
We set the parameters as follows. The time horizon , the fixed order cost , the variable order cost , the salvage reward , the warehouse capacity , and the price . The probabilities of demands are , , respectively. Initial distribution , i.e., . Firstly we calculate the SAS-function by . Secondly, we calculate the SA-function by Equation (1). Now we have two MDPs with different reward functions: and .
Appendix B CDF Estimation for Long-Horizon Markov Reward Process
Theorem 17.4.4 in [Meyn and Tweedie, 2009] showed that when and , the asymptotic variance can be calculated by
- [Artzner et al., 1998] Artzner, P., Delbaen, F., Eber, J., and Heath, D. (1998). Coherent measures of risk. Mathematical Finance, 9(3):1–24.
- [Boda and Filar, 2006] Boda, K. and Filar, J. A. (2006). Time Consistent Dynamic Risk Measures. Mathematical Methods of Operations Research, 63(1):169–186.
- [Bonami and Lejeune, 2009] Bonami, P. and Lejeune, M. A. (2009). An Exact Solution Approach for Portfolio Optimization Problems Under Stochastic and Integer Constraints. Operations Research, 57(3):650–670.
- [Bouakiz and Kebir, 1995] Bouakiz, M. and Kebir, Y. (1995). Target-level criterion in Markov decision processes. Journal of Optimization Theory and Applications, 86(1):1–15.
- [Dai et al., 2011] Dai, P., Weld, D. S., and Goldsmith, J. (2011). Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42:181–209.
- [Defourny et al., 2008] Defourny, B., Ernst, D., and Wehenkel, L. (2008). Risk-Aware Decision Making and Dynamic Programming. In Proceedings of NIPS-08 Workshop on Model Uncertainty and Risk in Reinforcement Learning, pages 1–8.
- [Delage and Mannor, 2010] Delage, E. and Mannor, S. (2010). Percentile optimization for markov decision processes with parameter uncertainty. Operations research, 58(1):203–213.
- [Derman, 1970] Derman, C. (1970). Finite State Markovian Decision Processes. Academic Press, Inc.
- [Filar et al., 1995] Filar, J. A., Krass, D., Ross, K. W., and Member, S. (1995). Percentile Performance Criteria For Limiting Average Markov Decision Processes. IEEE Transactions on Automatic Control, 40(I):2–10.
- [Glynn and Meyn, 1996] Glynn, P. W. and Meyn, S. P. (1996). A lyapunov bound for solutions of poisson’s equation. The Annals of Probability, pages 916–931.
- [Hou et al., 2014] Hou, P., Yeoh, W., and Varakantham, P. (2014). Revisiting risk-sensitive mdps: New algorithms and results. In Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), pages 136–144.
- [Kira et al., 2012] Kira, A., Ueno, T., and Fujita, T. (2012). Threshold probability of non-terminal type in finite horizon Markov decision processes. Journal of Mathematical Analysis and Applications, 386(1):461–472.
- [Kolobov et al., 2011] Kolobov, A., Mausam, Weld, D. S., and Geffner, H. (2011). Heuristic search for generalized stochastic shortest path mdps. In Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), pages 130–137.
- [Kontoyiannis and Meyn, 2003] Kontoyiannis, I. and Meyn, S. P. (2003). Spectral theory and limit theorems for geometrically ergodic markov processes. Annals of Applied Probability, 13:304–362.
- [Mannor and Tsitsiklis, 2011] Mannor, S. and Tsitsiklis, J. (2011). Mean-Variance Optimization in Markov Decision Processes. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1–22.
- [Meyn and Tweedie, 2009] Meyn, S. P. and Tweedie, R. L. (2009). Markov chains and stochastic stability. Springer Science & Business Media.
- [Ohtsubo and Toyonaga, 2002] Ohtsubo, Y. and Toyonaga, K. (2002). Optimal policy for minimizing risk models in Markov decision processes. Journal of mathematical analysis and applications, 271(1):66–81.
- [Prashanth et al., 2015] Prashanth, L. A., Cheng, J., Fu, M., Marcus, S., and Jun, L. G. (2015). Cumulative Prospect Theory Meets Reinforcement Learning : Estimation and Control. Working Paper, pages 1–27.
- [Puterman, 1994] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
- [Randour et al., 2015] Randour, M., Raskin, J., and Sankur, O. (2015). Percentile Queries in Multi-dimensional Markov Decision Processes. Computer Aided Verification, 9206:123–139.
- [Riedel, 2004] Riedel, F. (2004). Dynamic coherent risk measures. Stochastic Processes and their Applications, 112(2):185–200.
- [Ruszczyński and Shapiro, 2006] Ruszczyński, A. and Shapiro, A. (2006). Optimization of Convex Risk Functions. Mathematics of Operations Research, 31(3):433–452.
- [Sobel, 1994] Sobel, M. J. (1994). Mean-Variance Tradeoffs in an Undiscounted MDP. Operations Research, 42(1):175–183.
- [Steinmetz et al., 2016] Steinmetz, M., Hoffmann, J., and Buffet, O. (2016). Goal probability analysis in mdp probabilistic planning: Exploring and enhancing the state of the art. Journal of Artificial Intelligence Research, 57:229–271.
- [White, D. J., 1988] White, D. J. (1988). Mean , Variance , and Probabilistic Criteria in Finite Markov Decision Processes : A Review. Journal of Optimization Theory and Applications, 56(1):1–29.
- [Wu and Lin, 1999] Wu, C. and Lin, Y. (1999). Minimizing Risk models in Markov decision process with policies depending on target values. Journal of Mathematical Analysis and Applications, 23(1):47–67.
- [Xu and Mannor, 2011] Xu, H. and Mannor, S. (2011). Probabilistic goal Markov decision processes. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 2046–2052.
- [Yiu et al., 2004] Yiu, K. F. C., Wang, S. Y., and Mak, K. L. (2004). Optimal portfolios under a value-at-risk constraint. Journal of Economic Dynamics and Control, 28(7):1317–1334.
- [Yu et al., 2015] Yu, P., Yu, J. Y., and Xu, H. (2015). Central-limit approach to risk-aware markov decision processes. arXiv:1512.00583.
- [Yu et al., 1998] Yu, S. X., Lin, Y., and Yan, P. (1998). Optimization models for the first arrival target distribution function in discrete time. Journal of mathematical analysis and applications, 225(1):193–223.
- [Zheng, 2009] Zheng, H. (2009). Efficient frontier of utility and CVaR. Mathematical Methods of Operations Research, 70(1):129–148.