# A Scheme for Dynamic Risk-Sensitive Sequential Decision Making

###### Abstract

We present a scheme for sequential decision making with a risk-sensitive objective and constraints in a dynamic environment. A neural network is trained as an approximator of the mapping from parameter space to space of risk and policy with risk-sensitive constraints. For a given risk-sensitive problem, in which the objective and constraints are, or can be estimated by, functions of the mean and variance of return, we generate a synthetic dataset as training data. Parameters defining a targeted process might be dynamic, i.e., they might vary over time, so we sample them within specified intervals to deal with these dynamics. We show that: i). Most risk measures can be estimated using return variance; ii). By virtue of the state-augmentation transformation, practical problems modeled by Markov decision processes with stochastic rewards can be solved in a risk-sensitive scenario; and iii). The proposed scheme is validated by a numerical experiment.

## 1 Introduction

Cities are constantly involved in complex changing processes. City logistics is characterized by multiple stakeholders with different objectives and constraints. Sustainability plays an increasingly significant role in planning and management within organizations and across supply chains [1]. The sustainable city logistics studies express reflections and developments on the economic and the environmental problems on supply chain management, but only few of them take risk into consideration. Since many sustainable city logistics problems, such as planning, routing and scheduling, can be modeled as sequential decision making (SDM) problems [2], in this paper, we propose a scheme composed of reinforcement learning (RL) methods for risk-sensitive SDMs with constraints in dynamic environments.

Risk-sensitive objectives have been drawing more attention in many practical problems, especially in which the small probability events have serious consequences. Since various goals are targeted in practice, one solution is to optimize one of them as objective and take others as constraints. Furthermore, the environment might vary over time, so we sample the variables involved in the decision-making within specified intervals to deal with these dynamics. In this paper, we consider the SDM problems with three concerns: risk, constraint, and dynamic environment. In order to solve this problem, we propose a scheme for generating a synthetic dataset and training an approximator (here we use neural network as an example). The motivation for the proposed scheme arises from practical problems in which exact analytic risk analysis may be excessively complicated or prohibitively expensive, especially in a dynamic scenario. Though historical data is usually used for neural network training, in many practical situations, the recorded decisions are not optimal. Furthermore, in risk-sensitive cases, the criteria based on which the decisions were made are not clarified. Therefore, we generate a synthetic dataset for problems with specified risk-sensitive objective and constraints. For a given parameter set, we construct a Markov decision process (MDP), calculate the return variance for any deterministic policy, and then evaluate (or estimate) the specified risk measures and record the optimal policy.

A practical inventory control problem is considered as an example, in which the risk objective and constraints can be evaluated or estimated with return variance. An inventory control problem is dynamic in the long run. For example, the wholesale prices are changing all the time, as well as the uncertainties of the market demands and supplier reliabilities. How to adapt to these dynamics efficiently from a risk perspective is not only an academic concern, but also a business one.

## 2 Constrained Markov Decision Processes and Risk Estimations

In this section, firstly, we introduce the notations of constrained MDPs. Secondly, we present three types of law-invariant risk measures commonly studied in RL. We show that, the three risk measures can be evaluated or estimated with the return variance, and any of the three risks can be the objective or a constraint in our study, as well as other functions of the mean and variance of return. Thirdly, we review the return variance calculation method for a Markov process with a deterministic state-based reward. Finally, we restate the SAT as an MDP homomorphism, which enables the variance calculation method for a Markov process with a stochastic transition-based reward.

### 2.1 Constrained Markov Decision Processes

MDPs provide a framework for modeling sequential decision making process. In this paper, we focus on infinite-horizon discrete-time MDPs. An MDP with a stochastic reward can be represented by

in which is a finite state space, and represents the state at epoch ; is the allowable action set for , is a finite action space, and represents the action at epoch ; is a bounded finite subset of , and is the set of possible values of the immediate rewards, and denote the immediate reward at epoch ; is the transition probability, with denotes the homogeneous transition probability; is the reward distribution, with the probability that the immediate reward at time is , given current state , action , and next state ; is the initial state distribution; and is the discount factor. Given such an MDP, people concern the discounted total reward, or the return

(1) |

In most cases, the expected return is considered as the optimality criterion (objective).

A policy describes how to choose actions sequentially. An MDP with a (randomized) policy induces a Markov reward process. In an infinite horizon case, a stationary policy space is considered. Randomized policy is often considered in constrained MDPs [3]. In this study, we focus on the deterministic policy space , which is easy-to-use in practice. An optimal policy with respect to a criterion refers to an induced Markov process which optimize the criterion.

In many practical SDM problems, a single objective might not suffice to describe the real cases. To deal with various goals, a natural way is to optimize one objective with constraints on other goals [3]. In particular, since city logistics is characterized by multiple stakeholders with different objectives and constraints, we model such problems as constrained MDPs. A constrained MDP is an MDP with a criterion composed of an objective and constraint(s). In this study, we consider risk-sensitive objective and constraints in the function space , where the return refers to Equation (1). Next, we review the risk measures widely studied in RL.

### 2.2 Risk Measures

In RL, uncertainty is studied from two perspectives. One is the external uncertainty, which refers to the parameter uncertainty or disturbance. When the model is unknown, its parameters are usually estimated first, and then the optimal solution is calculated with a model-based approach. However, the parameter estimation depends on noisy data in practice, and the modeling errors may result in negative consequences. In control theory, this problem is known as robust control. Robust control methods consider the uncertain parameters within some compact sets, and optimize the expected return with the worst-case parameters, in order to achieve good robust performance and stability [4].

The other refers to the inherent (or internal) uncertainty, which results from the stochastic nature of the process. We claim that most, if not all, inherent risk measures depend on the reward sequence , and a inherent risk measure can be denoted by , where . The inherent risk can be quantified by a dynamic measure or a law-invariant measure. Given a Markov process with a deterministic reward and a deterministic initial state, a sequence of risk measures , and denote the immediate reward at epoch by , a dynamic measure can be denoted in general as

which is sensitive to the order of the immediate rewards. Dynamic measures are usually assumed to have a set of properties, such as Markov, monotonicity and coherence, which yields a time-consistent risk measure with a nested structure. For further information on the dynamic risk measure, see [5].

Given a discount factor , a law-invariant [6] measure in an infinite horizon is a functional on the return. Three types of law-invariant risk have been widely studied in RL area.

#### 2.2.1 Utility risk

The original goal of a utility function is to represent the subjective preference [7]. One classic example can be the “St. Petersburg Paradox,” which refers to a lottery with an infinite expected reward, but no one would put up an arbitrary high stake to play it, since the probability of obtaining an high enough reward is too small. Mathematically, a utility function is a mapping from objective value space for all possible outcomes to subjective value space. A utility objective is usually in the form , where is a strictly increasing function. Denoting the return variance by , the most common used utility risk in RL is the exponential utility [8]

where models a constant risk sensitivity that risk-averse when . This can be seen more clearly with the Taylor expansion of the utility

#### 2.2.2 Mean-variance risk

Another type of risk measure can be mean-variance risks. The mean-variance risk measure is also known in finance as the modern portfolio theory [9, 10, 11]. The mean-variance analysis aims at optimal return at a given level of risk, or the optimal risk at a given level of return. In RL, several mean-variance models have been studied. Denoting the return standard deviation by , one model could be

where is a risk parameter, and when , it is a risk-averse objective. This is the first mean-variance model for exploring inventory management related problems [12]. The other model can be maximizing the expected return with a variance constraint, or minimizing the variance with an expected return constraint [13]. For a review on mean-variance risk, see [14].

#### 2.2.3 Quantile-based risk

The last type of risk measure used in practice refers to quantiles, which requires us to pay attention to discontinuities and intervals of quantile numbers. A commonly used quantile-based risk measure is value at risk (VaR). VaR originates from finance. For a given portfolio, a loss threshold (target level) and a horizon, VaR concerns the probability that the loss on the portfolio exceeds the threshold over the time horizon. Mathematically, VaR can be defined as to find a policy which maximizes the smallest possible outcome with respect to a specified probability level. In RL, the VaR objective can also be defined as to find a policy which maximizes the probability that the return is larger than or equal to a specified target (threshold) [15, 16]. Denote as the return CDF from a policy . In this paper, we consider two VaR problems [15] in an infinite-horizon MDP.

###### Problem 1

Given a quantile , find the optimal threshold .

###### Problem 2

Given a threshold , find the optimal quantile .

Both VaR problems relate to , here we name it VaR function. In a long or infinite horizon case, the return distribution can be estimated with a strictly increasing function [17]. In this case, VaR can be considered as a study of the return distribution, since any point along is (estimated) with or . Therefore, both VaR problems refer to . VaR is straightforward but hard to deal with since it is not a coherent risk measure [18]. In many cases, conditional VaR (also known as expected shortfall) is preferred over VaR since it is coherent [19], i.e., it has some intuitively reasonable properties (convexity, for example). However, when the return can be assumed to be approximately normally distributed, VaR can be simply estimated with and [20].

In this paper, we focus on law-invariant risk measures on . The three law-invariant risk measures can be either calculated (mean-variance risk), estimated (utility risk), or estimated with assumption (quantile-based risk) with and . By virtue of the state-augmentation transformation (SAT), can be calculated in a Markov process with a stochastic reward, which allows risk evaluation in practical SDM problems. Next, we show how to calculate the return variance in a Markov process with a deterministic reward, and how the SAT enables the method to a Markov process with a stochastic reward.

### 2.3 Variance Formula for Markov Processes

As shown above, all law-invariant risk measures can be evaluated with the return variance. To calculate the return variance, Sobel [21] presented the formula for the Markov process with a deterministic reward.

###### Theorem 1

Given an infinite-horizon Markov process with a finite state space , a reward function deterministic state-based and bounded, and a discount factor . Denote the transition matrix by , in which . Denote the conditional return expectation by for any deterministic initial state , and the conditional expectation vector by . Similarly, denote the conditional return variance by , and the conditional variance vector by . Let denote the vector whose th component is . Then

Notice that the variance formula is for Markov processes with deterministic reward functions only. How to apply the method to practical problems with stochastic rewards is the problem. In next section, we review the SAT [22] as an MDP homomorphism, which enables the variance formula in a Markov process with a stochastic reward.

### 2.4 SAT Homomorphism

In many practical risk-sensitive problems, the rewards of the Markov processes are stochastic, but many methods may require the reward in a determined form [23, 24, 25, 26, 27, 28, 29]. To enable the method in those cases, we may use the SAT to transform the Markov process and preserve the reward sequence . In this paper, the example is that the variance formula is for Markov processes with deterministic reward functions only. We restate the SAT as an MDP homomorphism. Comparing with the original SAT theorem [22], the homomorphism version of SAT is on a more abstract level. An MDP homomorphism is a formalism that captures an intuitive notion of specific equivalence between MDPs [30]. In order to convert an MDP with to with and preserve , we consider each “situation”, which determines immediate reward, as an augmented state. We can then attach each possible reward value to an augmented state in . Formally, we define the SAT homomorphism as follows.

###### Definition 1 (SAT, a homomorphism version)

The SAT for MDPs is a homomorphism from an MDP to an MDP . The state space , where , , and . For , we have , , and for , ; for , we have , , and .

We call the homomorphic image of under . For any policy in , there exists a policy in , such that the two processes share the same . We define the mapping between the two policy spaces as a policy lift.

###### Definition 2 (Policy lift)

Let be a homomorphic image of under . Let be a stochastic policy in . Then lifted to is the policy such that for and for .

Given an MDP with a policy, the randomness of the induced Markov reward process can be studied in its underlying probability space.

###### Definition 3 (Underlying probability space)

Let be a probability space, and a measurable space with . An induced Markov reward process can be represented by an -valued stochastic process on with a family of random variables for . is called the underlying probability space of the process . For all , the mapping is called the trajectory of the process with respect to . The process is progressively measurable with respect to the filtration .

A homomorphism version of the SAT theorem is as follows, which claims that the probability measure on trajectories is preserved under . Therefore, as a subsequence of sample path, the probability measure on is preserved as well.

###### Theorem 2 (Probability measure preservation)

Let be an image of under homomorphism . Let be the stochastic policy lifted from . For the two processes with and with , there exists a bijection , such that for the underlying probability space for the first process, we have a sample path probability space for the second process, such that for any , .

## 3 Sequential Decision Making Scheme and Approximator

In this section, we propose a scheme for solving risk-sensitive SDM problems with constraints in a dynamic environment. And we consider neural network (NN) as an example of the approximator for the scheme.

### 3.1 Sequential Decision Making Scheme

A constrained, dynamic, risk-sensitive SDM problem can be considered as a process (environment) with a criterion (objective with constraints). For scuh an SDM problem, we assume that its process is defined by a vector of environmental features , and denote it by at epoch . We allow the components of to vary within some prespecified intervals. Similarly, a targeted risk measure and risk constraints might be defined by a vector of parameters , and and denote it by at epoch . To allow the decision-making for dynamic risk concern, the components of are within some prespecified intervals as well. For any given and , denote the optimal risk by , and the optimal policy by . In this case, an imagined decision maker can be mathematically denoted by , i.e., a theoretically perfect decision maker consider all the environmental parameters, the objective, and possible constraints, and choose the optimal policy with respect to some targeted risk measure. The goal is to train a NN to approximate .

Usually, historical data is used in training an approximator. However, at least two problems should be considered in practice. The first problem refers to information incompleteness. Most historical data contains no information on and (or) , and in many cases, even the information on is incomplete. The second problem is optimality. In practice, the decision makers might prefer an easy-to-use policy than an optimal one, which is hard to determine since the practical problems could be different from the theoretical model diversely and subtly. Therefore, we propose to evaluate or estimate risk measures with RL methods and train NN with a synthetic dataset.

We present a scheme in Figure 1 to solve a constrained risk-sensitive SDM problem in a dynamic environment. In the feature representation module, we define the SDM process with . In the SDM modeling module, a mathematic model is chosen to represent the process and offer a platform for risk evaluation, and here we use MDP as the model. In the risk evaluation module, we calculate the optimal policy for a specified risk measure with risk constraints. Here, for a given MDP with criterion (an objective with constraint(s), which can be risk-sensitive), firstly, we enumerate all deterministic policies to generate a Markov process. Secondly, we implement SAT to transform the Markov process to one with a determined reward (See Section 2.4), in order to calculate the return variance. Thirdly, we estimate the specified risk-measure(s) by the return variance (See Section 2.3), and here we consider VaR as an example (See Section 2.2). In the label generation module, the optimal VaR and policy is recorded and attached to the feature vector as a label. Finally, a functional approximator, such as an NN or a linear combination of polynomial basis or Fourier basis, is trained with the labeled feature dataset. In this paper, we consider an inventory control problem as an example. The feature set is fully described in Section 4.

Briefly, the upper level of the scheme is for synthetic data generation (modeling and risk evaluation), and the lower level is for approximator training for dynamic SDM problems. After each decision-making epoch, some parameters will be updated with the outcome (such as market demand distribution) and the involved external variables (such as wholesale price). Suppose the updated parameters are still within the predefined intervals, the trained NN is able to output the estimated optimal policy and risk at the next decision-making epoch. We present the algorithm for the synthetic dataset generation (upper level of the scheme) as follows.

### 3.2 Approximator

A functional approximator needs training with a dataset to approximate the mapping from parameter space to risk and policy space. In this study, We use neural network (NN), which is a universal function approximator [31]. Other approximators, such as a linear combination of polynomial basis or Fourier basis, are not considered here. The definition of NN is as follows.

###### Definition 4 (Feed forward neural network)

Let be the number of layers, and be the numbers of nodes of different layers. Denote the activation function by . For any , denote the affine function by for some and . For any , and , the entry of denotes the weight of the edge from the -th node in the -th layer to the -th node in the -th layer. For any layer , let be an affine function. With for , a function defined as

is called a feed forward neural network. Since we use NN as an estimator of the decision maker , we denote the NN by the set of neural networks mapping from to with .

In this study, for any given , we construct an MDP model, and by virtue of the SAT [22], we can preserve the variance for a Markov process with a stochastic reward. As claimed in Section 2.2, most inherent risk measures can be evaluated or estimated by return variance with or without some assumption. We calculate and and attached them to and as labels. Finally, we train an NN with the synthetic dataset. In the next section, we present an inventory management problem as an example to show the implementation details.

## 4 Numerical Experiment

Figure 1 illustrates the scheme with NN and RL methods for a dynamic risk-sensitive SDM problem. The scheme includes synthetic dataset generation (upper layer), approximator training and predicting (lower layer). To generate the dataset for training and testing, we define and randomly generate a domain set. For each domain instance, we estimate the optimal VaR and the corresponding policy as follows. Firstly, construct an inventory MDP under some predefined assumptions. Secondly, enumerate all deterministic policies to achieve a set of Markov reward processes , and calculate the VaR function for the MDP with a stochastic reward. Thirdly, acquire the set of transformed Markov process by Theorem 2, with being deterministic and state-based. Fourthly, estimate the mean and variance vectors by Theorem 1. Assuming the return is approximately normally distributed, we have the estimated return distributions for all to calculate the VaR function for the inventory MDP. Finally, for the risk sensitivity parameters in the domain instance, calculate risk-sensitive objective and constraints, and record . With the training dataset, we tune, train and validate an NN properly, and use it to predict and for any and .

### 4.1 An Inventory Management Example

In the management of sustainable city logistics, both costs and risks are required to be simultaneously evaluated, and these are often conflicting [32]. Moreover, considering the changing environment, the solution should be sensitive to the dynamics of the model parameters . As far as we know, most decision-making methods lack a holistic integrated vision, and do not attempt to safeguard sustainable city logistics security at different levels, thus causing unnecessarily high costs and disturbances. To shift from the risk transfer and toleration towards a risk-sensitive control, the proposed scheme in Figure 1 can be regarded as a solution in this aspect. As a proof of principle for the applicability of the proposed scheme, we apply it with NN to a practical inventory control process as an example, in which three factors are modeled: multiple sourcing, supplier reliability, and backlog. It is worth noting that MDP is a discrete-time process, so the output optimal policy is a periodic-review strategy, which has its limitations comparing with its aperiodic-review counterparts. Furthermore, some assumptions are made to keep the model small. Hence, this inventory control SDM model is theoretical, and in many cases a more accurate estimation can be achieved through simulation with respect to complicated cases.

#### 4.1.1 Multiple sourcing

Multiple sourcing (multisourcing) is a strategy that blends services from the optimal set of internal and external suppliers in the pursuit of business goals [33]. We consider two suppliers and , and a retailer . The scenario involves a multinational corporation () which functions in the supply chain as a retailer. For a targeted product, this corporation has two suppliers and (in US and in India, for example). The supplier in US is close and reliable, the other supplier is far and unreliable but with a lower wholesale price.

#### 4.1.2 Supplier reliability

Supplier (sourcing) reliability can be expressed in terms of quantity, quality or timing of orders due to multiple factors, such as equipment breakdowns, material shortages, warehouse capacity constraints, price inflations, strikes, embargoes, and political crises [34]. In the framework of MDP, one way to model these uncertainties is to take the status of a supplier as a Boolean variable, which represents whether the supplier is “available” or “unavailable”, and it can be modeled as a two-state Markov process [35]. In order to simplify the state representation, we assume that the unreliable supplier is available with a probability , and the retailer will be compensated with an amount if the supplier with an order is unreliable at the current epoch. The probability can be estimated using historical data.

Transportation can be another source of unreliability. It emphasizes the long delivery time which mainly results from a long physical distance. This uncertainty can be also a critical factor at times. In our model, we assume is reliable and its transportation lead time can be ignored (for example, a day when one epoch is a month). Assume offers a lower price but unreliable. Since it is far from , its lead time is one period. To simplify the model, we consider the transportation factor within the supplier reliability.

#### 4.1.3 Backlog and lost sale

For each unsatisfied unit of demand, a backlog or a lost sale occurs. Suppose the probabilities of backlog and lost sale are and , respectively. The probability can be estimated using historical data. We assume that when a backlog happens, the product will be sold at a lower price (which can be due to from fast delivery or other reasons), where is the retail price. A lost sale induces a cost . Rewards for both cases will be calculated at the current epoch, in order to keep the state representation small. The backlog and lost sale instances are associated with market demand. To keep the stochastic reward function simple, finite possible demands are considered (i.e., a finite support for the reward function). For example, when the warehouse capacity is four units and the maximum demand is six units, then the possible maximum amount of backlogs or lost sales is two units. In the MDP model, the reward and cost from these two units will be received at a discount at the current epoch.

To simplify the model, other factors, such as fast delivery, order replacement and foreign currency exchange, have not been considered. The inventory control process is illustrated in Figure 2, in which the solid arrows represent the product flow, and the dashed arrows represent the order flow. Next, we construct an MDP for this inventory control problem.

### 4.2 Inventory MDP

Here we use an infinite-horizon MDP with finite state and action space to model the inventory control process. The related parameters are given in Table 1. For a set of specified parameters, we construct the MDP as follows. For time , denote the inventory level by , and the number of items in transit . Given the warehouse capacity , the state is , where . The action is , where are the order quantities with the two suppliers. Assume that the maximum demand at retailer could be double of the warehouse capacity, i.e., , and the demand is with probability . Given and to represnet supplier availability and backlog probabilities, respectively, the reward and transition probability can be formulated. For example, at time , given the state , the action , the demand , being available currently, and the next state , the reward is

where . Then, the transition probability is given by

Consider another example, where is unavailable. For the backlogging of one unit and the lost sale of one unit, the reward is

and the transition probability is

Next, we consider a VaR objective with a risk-sensitive constraint, and generate the synthetic training data for NN.

warehouse capacity | |

Discount factor | |

Initial state distribution | |

Retail price | |

Penalty on supplier’s unavailability | |

Wholesale price from and | |

Backlog cost | |

Lost sale cost, which could be a dependent variable | |

Fixed ordering cost | |

Holding cost | |

Demand distribution function (PMF) | |

Supplier availability probability | |

Backlog probability | |

Risk sensitivity parameter (for VaR) |

### 4.3 Dataset Generation

In this paper, we consider the VaR Problem 1 as the objective with a constraint as an example, in which is a prespecified constant. The intuitive meaning of this constraint is that the earning per unit of risk (variance) should be larger than a threshold . We artificially generate a dataset for the inventory control process with an objective of minimizing VaR with a confidence level and a constraint , in which is a constant.

There are multiple ways to define a mapping from parameter space to risk and policy space. Here we define it as follows. Define the domain set , with each instance refering to a vector of inventory control parameters (retail price, fixed ordering cost, etc.). Define the label set , with each instance refering to a vector of labels (optimal VaR, optimal policy, etc.). A dataset is a finite set of pairs in , i.e., a set of labeled domain instances.

It worth noting that, to construct an NN one needs to predefine certain parameters, including but not limit to , and (See Section 3.2 for details). The predefined parameters, which determine the network structure and how the NN is trained, are called hyperparameters. A fine-tuned NN with a large enough dataset should be able to make a satisfactory prediction. For our inventory control problem, all related parameters are set and sampled as follows.

Hyperparameters: A domain instance includes all variable values except for warehouse capacity , discount factor , and initial inventory distribution , which are considered as hyperparameters. We set , and the inventory level space , and . We set , i.e., at the beginning the inventory level and the item amount in transit are both zero. The latter two parameters are fixed only to simplify the problem.

Revenue related parameters: Set the retail price , and for generating the dataset, we take it as a random variable, whose distribution is ^{1}^{1}1A uniform distribution with a support ..
Similarly, set the penalty on supplier’s unavailability with ,
the wholesale prices from and are with and with ,
the backlog cost with ,
the fixed ordering cost (for both supplier) with , and
the holding cost with .
For brevity, we set the lost sale cost .

For probabilistic parameters, we set the supplier availability probability with , the backlog probability with , and the risk sensitivity parameter with . For the demand distribution vector , we sample it uniformly from a -simplex. After defining the domain features, labels are calculated, which include the optimal risk and policy.

We denote a policy list (vector) by , in which each item represents an allowable policy. An NN will be trained as an approximator of the function, whose inputs represent the inventory parameters and , and outputs represent and the corresponding policy (denoted by a Boolean vector or an index), i.e., the domain set , and the label set in general. Next, we train an NN with the synthetic dataset and show the validity of the proposed scheme.

### 4.4 Numerical Result

Since VaR is numeric and (deterministic) inventory policy is categorical, we consider both of them to be numeric. Considering the setting given in Section 4.3, we set a network . Now we use Keras to construct, train and validate an NN for this small inventory problem. Firstly, we try a 3-layer NN. We set , the numbers of nodes . We set the relu function and the linear function as the activation functions for the hidden and output layers, respectively. Set the mean squared error as the loss function, and adam as the optimizer. Given that all hyperparameters are determined, the network is trained with a training dataset . We train the NN 50 times for the means and variances of the loss values at different epochs. We set the batch size to be 50 for the batch gradient descent.

Since it is a regression problem, we measure the loss by the regression metrics, such as mean absolute error or mean squared error (MSE). In Figure 3, it shows that the result with the training data size converges within 10 epochs, with a hit rate around in both training and validation phases. The error regions represent the standard deviations of means along the epoch axis.

## 5 Related Work

The SDM problems considering risk, dynamic environment, and constraint are usually studied separately. Besides the works reviewed in Section 2.2, Shen [36] generalized risk measures to the valuation functions. The author applied a set of valuation functions, derived some model-free risk-sensitive reinforcement learning algorithms, and presented a risk control example in simulated algorithmic trading of stocks. For SDMs in dynamic environments, Hadoux [37] proposed a new model named Hidden Semi-Markov-Mode Markov Decision Process (HS3MDP), which represented non-stationary problems whose dynamics evolved among a finite set of contexts. The author adapted the Partially Observable Monte-Carlo Planning (POMCP) algorithm to HS3MDPs in order to efficiently solve those problems. The POMCP algorithm used a black-box environment simulator and a particle filter to approximate a belief state. The simulator relaxed the model-based requirement, and each filter particle represented a state of the POMDP being solved. For different types of dynamic environment, the author compared a regret-based method with its Markov counterpart [38]. In the regret-based method, the agent was involved in a two-players repeated game, where two agents (the player and the opponent, which can be the environment) chose an action to play, got a feedback, and repeated the game.

For SDMs with constraint(s), Chow [28] investigated the variability of stochastic rewards and the robustness to modeling errors. The author analyzed a unifying planning framework with coherent risk measures (such as CVaR), which was robust to inherent uncertainties and modeling errors, and output a time-consistent policy. A scalable approximate value-iteration algorithm on an augmented state space was developed for large scale problems with a data-driven setup. The author proposed novel policy gradient and actor-critic algorithms for CVaR-constrained and chance-constrained optimization in MDPs. Furthermore, the author proposed a framework for risk-averse model predictive control, where the risk was time-consistent and Markovian. In the framework of CMDP [3], Chow et al. [39] derived two LP-based algorithms—safe policy iteration and safe value iteration—for problems with constraints on expected cumulative costs. The algorithms hinged on a novel Lyapunov method, which constructed Lyapunov functions to provide an effective way to guarantee the global safety of a policy during training via a set of local, linear constraints. For unknown environment models and large state/action spaces, the authors proposed two scalable safe RL algorithms: safe DQN, an off-policy fitted Qâiteration method, and safe DPI, an approximate policy iteration method. Another way to consider risk in an SDM problem is to avoid some dangerous system states. Geibel and Wysotzki [40] defined the risk with respect to a policy as the probability of entering such states, and set the constraint as the probability being smaller than a predefined threshold. The authors presented a model-free RL algorithm with a deterministic policy space. The algorithm was based on weighting the original value function and the risk. The weight parameter was adapted to find a feasible solution for the constrained problem, and the probability defining the risk was expressed as the expectation of a cumulative return.

## 6 Conclusions and Future Work

We propose a scheme using functional approximator and RL methods to solve SDM problems with risk-sensitive constraints in a dynamic scenario. We consider risk measures as functions of mean and variance of the return in an induced Markov process. As shown in Section 2.2, most, if not all, law-invariant risk measures can be evaluated or estimated with return variance. Considering that reward functions in practical problems are often stochastic, we implement SAT to enable the variance formula and preserve the reward sequence. In an inventory control problem with practical factors, an NN is trained and validated with a synthetic training dataset in the numerical experiment.

The current work can be extended in two ways. The first one has to do with the “no free lunch” theorem. In our inventory control model, the more factors considered, the larger size the policy space has. When the warehouse capacity is 3, there are 8748 deterministic policies, when it is 4 and 5, the size goes to around and , respectively. How to deal with a policy space of such an astronomical magnitude in a risk-sensitive case is still an unsettled problem. The second one is how to deal with two different output data types in an NN efficiently. In our setting, the outputs include optimal risk value (numeric) and policy (categorical). A crucial task is how to set activation functions, normalization, and loss function for network training.

## References

- [1] K. Grzybowska and G. KovÃ¡cs, “Sustainable supply chain - supporting tools,” in 2014 Federated Conference on Computer Science and Information Systems, pp. 1321–1329, 2014.
- [2] A. Franceschetti, Sustainable city logistics : fleet planning, routing and scheduling problems. PhD thesis, Technische Universiteit Eindhoven, 2015.
- [3] E. Altman, Constrained Markov Decision Processes. CRC Press, 1999.
- [4] A. Nilim and L. E. Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Operations Research, vol. 53, no. 5, pp. 780–798, 2005.
- [5] A. Ruszczyński, “Risk-averse dynamic programming for Markov decision processes,” Mathematical Programming, vol. 125, no. 2, pp. 235–261, 2010.
- [6] S. Kusuoka, “On law invariant coherent risk measures,” in Advances in Mathematical Economics, pp. 83–95, Springer, 2001.
- [7] R. A. Howard and J. E. Matheson, “Risk-sensitive Markov decision processes,” Management science, vol. 18, no. 7, pp. 356–369, 1972.
- [8] K.-J. Chung and M. J. Sobel, “Discounted MDPâs: Distribution functions and exponential utility maximization,” SIAM journal on control and optimization, vol. 25, no. 1, pp. 49–62, 1987.
- [9] D. J. White, “Mean , variance , and probabilistic criteria in finite Markov decision processes : A review,” Journal of Optimization Theory and Applications, vol. 56, no. 1, pp. 1–29, 1988.
- [10] M. J. Sobel, “Mean-variance tradeoffs in an undiscounted MDP,” Operations Research, vol. 42, no. 1, pp. 175–183, 1994.
- [11] S. Mannor and J. Tsitsiklis, “Mean-variance optimization in Markov decision processes,” in Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 1–22, 2011.
- [12] H.-S. Lau, “The newsboy problem under alternative optimization objectives,” Journal of the Operational Research Society, vol. 31, no. 6, pp. 525–535, 1980.
- [13] T.-M. Choi, D. Li, and H. Yan, “Mean–variance analysis for the newsvendor problem,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 38, no. 5, pp. 1169–1180, 2008.
- [14] C.-H. Chiu and T.-M. Choi, “Supply chain risk analysis with mean-variance models: A technical review,” Annals of Operations Research, vol. 240, no. 2, pp. 489–507, 2016.
- [15] J. A. Filar, D. Krass, K. W. Ross, and S. Member, “Percentile performance criteria for limiting average Markov decision processes,” IEEE Transactions on Automatic Control, vol. 40, no. I, pp. 2–10, 1995.
- [16] C. Wu and Y. Lin, “Minimizing risk models in Markov decision process with policies depending on target values,” Journal of Mathematical Analysis and Applications, vol. 23, no. 1, pp. 47–67, 1999.
- [17] S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability. Springer Science & Business Media, 2009.
- [18] F. Riedel, “Dynamic coherent risk measures,” Stochastic Processes and their Applications, vol. 112, no. 2, pp. 185–200, 2004.
- [19] P. Artzner, F. Delbaen, J. Eber, and D. Heath, “Coherent measures of risk,” Mathematical Finance, vol. 9, no. 3, pp. 1–24, 1998.
- [20] S. Ma and J. Y. Yu, “Transition-based versus state-based reward functions for MDPs with Value-at-Risk,” in Proceedings of the 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 974–981, 2017.
- [21] M. J. Sobel, “The variance of discounted Markov decision processes,” Journal of Applied Probability, vol. 19, no. 4, pp. 794–802, 1982.
- [22] S. Ma and J. Y. Yu, “State-augmentation transformations for risk-sensitive reinforcement learning,” arXiv:1804.05950v2:, 2018.
- [23] V. S. Borkar, “Q-learning for risk-sensitive control,” Mathematics of Operations Research, vol. 27, no. 2, pp. 294–311, 2002.
- [24] Y. Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive reinforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298–1328, 2014.
- [25] J. García and F. Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015.
- [26] H. Gilbert and P. Weng, “Quantile reinforcement learning,” arXiv:1611.00862, 2016.
- [27] W. Huang and W. B. Haskell, “Risk-aware Q-learning for Markov decision processes,” in Proceedings of the 56th IEEE Conference on Decision and Control (CDC), pp. 4928–4933, 2017.
- [28] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017.
- [29] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS), pp. 908–918, 2017.
- [30] B. Ravindran and A. G. Barto, “Model minimization in hierarchical reinforcement learning,” in International Symposium on Abstraction, Reformulation, and Approximation, pp. 196–211, Springer, 2002.
- [31] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural networks, vol. 4, no. 2, pp. 251–257, 1991.
- [32] E. Taniguchi, R. G. Thompson, and T. Yamada, “Emerging techniques for enhancing the practical application of city logistics models,” Procedia-Social and Behavioral Sciences, vol. 39, pp. 3–18, 2012.
- [33] L. Cohen and A. Young, Multisourcing: Moving beyond outsourcing to achieve growth and agility. Harvard Business Press, 2006.
- [34] E. Mohebbi, “A replenishment model for the supply-uncertainty problem,” International Journal of Production Economics, vol. 87, pp. 25–37, 2004.
- [35] S. S. Ahiska, S. R. Appaji, R. E. King, and D. P. Warsing Jr, “A Markov decision process-based policy characterization approach for a stochastic inventory control problem with unreliable sourcing,” International Journal of Production Economics, vol. 144, no. 2, pp. 485–496, 2013.
- [36] Y. Shen, Risk sensitive Markov decision processes. PhD thesis, 01 2015.
- [37] E. Hadoux, Markovian sequential decision-making in non-stationary environments: application to argumentative debates. PhD thesis, UPMC, Sorbonne Universites CNRS, 2015.
- [38] S. P.-M. Choi, N. L. Zhang, and D.-Y. Yeung, “Solving hidden-mode markov decision problems.,” in AISTATS, Citeseer, 2001.
- [39] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” in Advances in Neural Information Processing Systems, pp. 8092–8101, 2018.
- [40] P. Geibel and F. Wysotzki, “Risk-sensitive reinforcement learning applied to control under constraints,” Journal of Artificial Intelligence Research, vol. 24, pp. 81–108, 2005.
- [41] W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quantile networks for distributional reinforcement learning,” arXiv preprint arXiv:1806.06923, 2018.
- [42] S. Zhang, B. Mavrin, H. Yao, L. Kong, and B. Liu, “Quota: The quantile option architecture for reinforcement learning,” arXiv preprint arXiv:1811.02073, 2018.