A Proofs of Theoretical Results

Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach

Abstract

In this paper we address the problem of decision making within a Markov decision process (MDP) framework where risk and modeling errors are taken into account. Our approach is to minimize a risk-sensitive conditional-value-at-risk (CVaR) objective, as opposed to a standard risk-neutral expectation. We refer to such problem as CVaR MDP. Our first contribution is to show that a CVaR objective, besides capturing risk sensitivity, has an alternative interpretation as expected cost under worst-case modeling errors, for a given error budget. This result, which is of independent interest, motivates CVaR MDPs as a unifying framework for risk-sensitive and robust decision making. Our second contribution is to present an approximate value-iteration algorithm for CVaR MDPs and analyze its convergence rate. To our knowledge, this is the first solution algorithm for CVaR MDPs that enjoys error guarantees. Finally, we present results from numerical experiments that corroborate our theoretical findings and show the practicality of our approach.

\externaldocument

appendix

1 Introduction

Decision making within the Markov decision process (MDP) framework typically involves the minimization of a risk-neutral performance objective, namely the expected total discounted cost [Bertsekas(2012)]. This approach, while very popular, natural, and attractive from a computational standpoint, neither takes into account the variability of the cost (i.e., fluctuations around the mean), nor its sensitivity to modeling errors, which may significantly affect overall performance [Mannor et al.(2007)Mannor, Simester, Sun, and Tsitsiklis]. Risk-sensitive MDPs [Howard and Matheson(1972)] address the first aspect by replacing the risk-neutral expectation with a risk-measure of the total discounted cost, such as variance, Value-at-Risk (VaR), or Conditional-VaR (CVaR). Robust MDPs [Nilim and El Ghaoui(2005)], on the other hand, address the second aspect by defining a set of plausible MDP parameters, and optimize decision with respect to the worst-case scenario.

In this work we consider risk-sensitive MDPs with a CVaR objective, referred to as CVaR MDPs. CVaR [Artzner et al.(1999)Artzner, Delbaen, Eber, and Heath, Rockafellar and Uryasev(2000)] is a risk-measure that is rapidly gaining popularity in various engineering applications, e.g., finance, due to its favorable computational properties [Artzner et al.(1999)Artzner, Delbaen, Eber, and Heath] and superior ability to safeguard a decision maker from the “outcomes that hurt the most” [Serraino and Uryasev(2013)]. In this paper, by relating risk to robustness, we derive a novel result that further motivates the usage of a CVaR objective in a decision-making context. Specifically, we show that the CVaR of a discounted cost in an MDP is equivalent to the expected value of the same discounted cost in presence of worst-case perturbations of the MDP parameters (specifically, transition probabilities), provided that such perturbations are within a certain error budget. This result suggests CVaR MDP as a method for decision making under both cost variability and model uncertainty, motivating it as unified framework for planning under uncertainty.

Literature review: Risk-sensitive MDPs have been studied for over four decades, with earlier efforts focusing on exponential utility [Howard and Matheson(1972)], mean-variance [Sobel(1982)], and percentile risk criteria [Filar et al.(1995)Filar, Krass, and Ross] . Recently, for the reasons explained above, several authors have investigated CVaR MDPs [Rockafellar and Uryasev(2000)]. Specifically, in [Borkar and Jain(2014)], the authors propose a dynamic programming algorithm for finite-horizon risk-constrained MDPs where risk is measured according to CVaR. The algorithm is proven to asymptotically converge to an optimal risk-constrained policy. However, the algorithm involves computing integrals over continuous variables (Algorithm 1 in [Borkar and Jain(2014)]) and, in general, its implementation appears particularly difficult. In [Bäuerle and Ott(2011)], the authors investigate the structure of CVaR optimal policies and show that a Markov policy is optimal on an augmented state space, where the additional (continuous) state variable is represented by the running cost. In [Haskell and Jain(2014)], the authors leverage such result to design an algorithm for CVaR MDPs that relies on discretizing occupation measures in the augmented-state MDP. This approach, however, involves solving a non-convex program via a sequence of linear-programming approximations, which can only shown to converge asymptotically. A different approach is taken by [Chow and Ghavamzadeh(2014)] and [Tamar et al.(2015)Tamar, Glassner, and Mannor], which consider a finite dimensional parameterization of control policies, and show that a CVaR MDP can be optimized to a local optimum using stochastic gradient descent (policy gradient). A recent result by Pflug and Pichler [Pflug and Pichler(2012)] showed that CVaR MDPs admit a dynamic programming formulation by using a state-augmentation procedure different from the one in [Bäuerle and Ott(2011)]. The augmented state is also continuous, making the design of a solution algorithm challenging.

Contributions: The contribution of this paper is twofold. First, as discussed above, we provide a novel interpretation for CVaR MDPs in terms of robustness to modeling errors. This result is of independent interest and further motivates the usage of CVaR MDPs for decision making under uncertainty. Second, we provide a new optimization algorithm for CVaR MDPs, which leverages the state augmentation procedure introduced by Pflug and Pichler [Pflug and Pichler(2012)]. We overcome the aforementioned computational challenges (due to the continuous augmented state) by designing an algorithm that merges approximate value iteration [Bertsekas(2012)] with linear interpolation. Remarkably, we are able to provide explicit error bounds and convergence rates based on contraction-style arguments. In comparison to the algorithms in [Borkar and Jain(2014), Haskell and Jain(2014), Chow and Ghavamzadeh(2014), Tamar et al.(2015)Tamar, Glassner, and Mannor], our approach leads to finite-time error guarantees, with respect to the globally optimal policy. In addition, our algorithm is significantly simpler than previous methods, and calculates the optimal policy for all CVaR confidence intervals and initial states simultaneously. The practicality of our approach is demonstrated in numerical experiments involving planning a path on a grid with thousand of states. To the best of our knowledge, this is the first algorithm to compute globally-optimal policies for non-trivial CVaR MDPs.

Organization: This paper is structured as follows. In Section 2 we provide background on CVaR and MDPs, we state the problem we wish to solve (i.e., CVaR MDPs), and motivate the CVaR MDP formulation by establishing a novel relation between CVaR and model perturbations. Section 3 provides the basis for our solution algorithm, based on a Bellman-style equation for the CVaR. Then, in Section 4 we present our algorithm and correctness analysis. in Section 5 we evaluate our approach via numerical experiments. Finally, in Section 6, we draw some conclusions and discuss directions for future work.

2 Preliminaries, Problem Formulation, and Motivation

2.1 Conditional Value-at-Risk

Let be a bounded-mean random variable, i.e., , on a probability space , with cumulative distribution function . In this paper we interpret as a cost. The value-at-risk (VaR) at confidence level is the quantile of , i.e., VaR. The conditional value-at-risk (CVaR) at confidence level is defined as [Rockafellar and Uryasev(2000)]:

(1)

where represents the positive part of . If there is no probability atom at VaR, it is well known that CVaR. Therefore, CVaR may be interpreted as the worst case expected value of , conditioned on the -portion of the tail distribution. It is well known that CVaR is decreasing in , equals to , and tends to as . During the last decade, the CVaR risk-measure has gained popularity in financial applications, among others. It is especially useful for controlling rare, but potentially disastrous events, which occur below the quantile, and are neglected by the VaR [Serraino and Uryasev(2013)]. Furthermore, CVaR enjoys desirable axiomatic properties, such as coherence [Artzner et al.(1999)Artzner, Delbaen, Eber, and Heath]. We refer to [Uryasev et al.(2010)Uryasev, Sarykalin, Serraino, and Kalinchenko] for further motivation about CVaR and a comparison with other risk measures such as VaR.

A useful property of CVaR, which we exploit in this paper, is its alternative dual representation [Artzner et al.(1999)Artzner, Delbaen, Eber, and Heath]:

(2)

where denotes the -weighted expectation of , and the risk envelop is given by Thus, the CVaR of a random variable may be interpreted as the worst-case expectation of , under a perturbed distribution .

In this paper, we are interested in the CVaR of the total discounted cost in a sequential decision-making setting, as discussed next.

2.2 Markov Decision Processes

An MDP is a tuple , where and are finite state and action spaces; is a bounded deterministic cost; is the transition probability distribution; is the discounting factor, and is the initial state. (Our results easily generalize to random initial states and random costs.)

Let the space of admissible histories up to time be , for , and . A generic element is of the form . Let be the set of all deterministic history-dependent policies with the property that at each time the control is a function of . In other words, . We also let be the set of all history dependent policies.

2.3 Problem Formulation

Let the sequence of random variables denote the stage-wise costs observed along a state/control trajectory in the MDP model, and let denote the total discounted cost up to time . The risk-sensitive discounted-cost problem we wish to address is as follows:

(3)

where is the policy sequence with actions for . We refer to problem (3) as CVaR MDP. (One may also consider a related formulation combining mean and CVaR, the details of which are presented in the supplementary material.)

The problem formulation in (3) directly addresses the aspect of risk sensitivity, as demonstrated by the numerous applications of CVaR optimization in finance (see, e.g., [Rockafellar et al.(2006)Rockafellar, Uryasev, and Zabarankin, Iyengar and Ma(2013), Dowd(2007)]) and the recent approaches for CVaR optimization in MDPs [Borkar and Jain(2014), Haskell and Jain(2014), Chow and Ghavamzadeh(2014), Tamar et al.(2015)Tamar, Glassner, and Mannor]. In the following, we show a new result providing additional motivation for CVaR MDPs, from the point of view of robustness to modeling errors.

2.4 Motivation - Robustness to Modeling Errors

We show a new result relating the CVaR objective in (3) to the worst-case expected discounted-cost in presence of worst-case perturbations of the MDP parameters, where the perturbations are budgeted according to the “number of things that can go wrong.” Thus, by minimizing CVaR, the decision maker also guarantees robustness of the policy.

Consider a trajectory in a finite-horizon MDP problem with transitions . We explicitly denote the time index of the transition matrices for reasons that will become clear shortly. The total probability of the trajectory is , and we let denote its discounted cost, as defined above.

We consider an adversarial setting, where an adversary is allowed to change the transition probabilities at each stage, under some budget constraints. We will show that, for a specific budget and perturbation structure, the expected cost under the worst-case perturbation is equivalent to the CVaR of the cost. Thus, we shall establish that, in this perspective, being risk sensitive is equivalent to being robust against model perturbations.

For each stage , consider a perturbed transition matrix , where is a multiplicative probability perturbation and is the Hadamard product, under the condition that is a stochastic matrix. Let denote the set of perturbation matrices that satisfy this condition, and let the set of all possible perturbations to the trajectory distribution.

We now impose a budget constraint on the perturbations as follows. For some budget , we consider the constraint

(4)

Essentially, the product in Eq. (4) states that the worst cannot happen at each time. Instead, the perturbation budget has to be split (multiplicatively) along the trajectory. We note that Eq. (4) is in fact a constraint on the perturbation matrices, and we denote by the set of perturbations that satisfy this constraint with budget . The following result shows an equivalence between the CVaR and the worst-case expected loss.

Proposition \thetheorem (Interpretation of CVaR as a Robustness Measure)

It holds

(5)

where denotes expectation with respect to a Markov chain with transitions .

The proof of Proposition 2.4 is in the supplementary material. It is instructive to compare Proposition 2.4 with the dual representation of CVaR in (2). Note, in particular, that the perturbation budget in Proposition 2.4 has a temporal structure, which constrains the adversary from choosing the worst perturbation at each time step.

{remark}

An equivalence between robustness and risk-sensitivity was previously suggested by Osogami [Osogami(2012)]. In that study, the iterated (dynamic) coherent risk was shown to be equivalent to a robust MDP [Iyengar(2005)] with a rectangular uncertainty set. The iterated risk (and, correspondingly, the rectangular uncertainty set) is very conservative [Xu and Mannor(2006)], in the sense that the worst can happen at each time step. In contrast, the perturbations considered here are much less conservative. In general, solving robust MDPs without the rectangularity assumption is NP-hard. Nevertheless, Mannor et. al. [Mannor et al.(2012)Mannor, Mebel, and Xu] showed that, for cases where the number of perturbations to the parameters along a trajectory is upper bounded (budget-constrained perturbation), the corresponding robust MDP problem is tractable. Analogous to the constraint set (1) in [Mannor et al.(2012)Mannor, Mebel, and Xu], the perturbation set in Proposition 2.4 limits the total number of log-perturbations along a trajectory. Accordingly, we shall later see that optimizing problem (3) with perturbation structure (4) is indeed also tractable.

Next section provides the fundamental theoretical ideas behind our approach to the solution of (3).

3 Bellman Equation for CVaR

In this section, by leveraging a recent result from [Pflug and Pichler(2012)], we present a dynamic programming (DP) formulation for the CVaR MDP problem in (3). As we shall see, the value function in this formulation depends on both the state and the CVaR confidence level . We then establish important properties of such DP formulation, which will later enable us to derive an efficient DP-based approximate solution algorithm and provide correctness guarantees on the approximation error. All proofs are presented in the supplementary material.

Our starting point is a recursive decomposition of CVaR, whose proof is detailed in Theorem 10 of [Pflug and Pichler(2012)].

{theorem}

[CVaR Decomposition Theorem, [Pflug and Pichler(2012)]] For any , denote by the cost sequence from time onwards. The conditional CVaR under policy , i.e., , obeys the following decomposition:

where is the action induced by policy , and the expectation is with respect to . Theorem 3 concerns a fixed policy ; we now extend it to a general DP formulation. Note that in the recursive decomposition in Theorem 3 the right-hand side involves CVaR terms with different confidence levels than that in the left-hand side. Accordingly, we augment the state space with an additional continuous state , which corresponds to the confidence level. For any and , the value-function for the augmented state is defined as:

Similar to standard DP, it is convenient to work with operators defined on the space of value functions [Bertsekas(2012)]. In our case, Theorem 3 leads to the following definition of CVaR Bellman operator :

(6)

We now establish several useful properties for the Bellman operator . {lemma}[Properties of CVaR Bellman Operator] The Bellman operator has the following properties:

  1. (Contraction.) where .

  2. (Concavity preserving in .) For any , suppose is concave in . Then the maximization problem in (6) is concave. Furthermore, is concave in .

The first property in Lemma 6 is similar to standard DP [Bertsekas(2012)], and is instrumental to the design of a converging value-iteration approach. The second property is nonstandard and specific to our approach. It will be used to show that the computation of value-iteration updates involves concave, and therefore tractable optimization problems. Furthermore, it will be used to show that a linear-interpolation of in the augmented state has a bounded error.

Equipped with the results in Theorem 3 and Lemma 6, we can now show that the fixed point solution of is unique, and equals to the solution of the CVaR MDP problem (3) with and . {theorem}[Optimality Condition] For any and , the solution to is unique, and equals to . Next, we show that the optimal value of the CVaR MDP problem (3) can be attained by a stationary Markov policy, defined as a greedy policy with respect to the value function . Thus, while the original problem is defined over the intractable space of history-dependent policies, a stationary Markov policy (over the augmented state space) is optimal, and can be readily derived from . Furthermore, an optimal history-dependent policy can be readily obtained from an (augmented) optimal Markov policy according to the following theorem. {theorem}[Optimal Policies] Let be a history-dependent policy recursively defined as:

(7)

with initial conditions and , and state transitions

(8)

where the stationary Markovian policy and risk factor are solution to the min-max optimization problem in the CVaR Bellman operator . Then, is an optimal policy for problem (3) with initial state and CVaR confidence level .

Theorems 3 and 3 suggest that a value-iteration DP method [Bertsekas(2012)] can be used to solve the CVaR MDP problem (3). Let an initial value-function guess be chosen arbitrarily. Value iteration proceeds recursively as follows:

(9)

Specifically, by combining the contraction property in Lemma 6 and uniqueness result of fixed point solutions from Theorem 3, one concludes that . By selecting and , one immediately obtains . Furthermore, an optimal policy may be derived from according to the policy construction procedure in Theorem 3.

Unfortunately, while value iteration is conceptually appealing, its direct implementation in our setting is generally impractical since, e.g., the state is continuous. In the following, we pursue an approximation to the value iteration algorithm (9), based on a linear interpolation scheme for .

4 Value Iteration with Linear Interpolation

In this section we present an approximate DP algorithm for solving CVaR MDPs, based on the theoretical results of Section 3. The value iteration algorithm in Eq. (9) presents two main implementation challenges. The first is due to the fact that the augmented state is continuous. We handle this challenge by using interpolation, and exploit the concavity of to bound the error introduced by this procedure. The second challenge stems from the the fact that applying involves maximizing over . Our strategy is to exploit the concavity of the maximization problem to guarantee that such optimization can indeed be performed effectively.

As discussed, our approach relies on the fact that the Bellman operator preserves concavity as established in Lemma 6. Accordingly, we require the following assumption for the initial guess ,

Assumption 1

The guess for the initial value function satisfies the following properties: 1) is concave in and 2) is continuous in for any .

Assumption 1 may easily be satisfied, for example, by choosing , where is any arbitrary bounded random variable. {algorithm}[t] CVaR Value Iteration with Linear Interpolation 1: Given:

  • interpolation points for every with , and .

  • Initial value function that satisfies Assumption 1.

2: For

  • For each and each , update the value function estimate as follows:

3: Set the converged value iteration estimate as , for any , and . As stated earlier, a key difficulty in applying value iteration (9) is that, for each state , the Bellman operator has to be calculated for each , and is continuous. As an approximation, we propose to calculate the Bellman operator only for a finite set of values , and interpolate the value function in between such interpolation points.

Formally, let denote the number of interpolation points. For every , denote by the set of interpolation points. We denote by the linear interpolation of the function on these points, i.e.,

where . The interpolation of instead of is key to our approach. The motivation is twofold: first, it can be shown [Rockafellar and Uryasev(2000)] that for a discrete random variable , is piecewise linear in . Second, one can show that the Lipschitzness of is preserved during value iteration, and exploit this fact to bound the linear interpolation error.

We now define the interpolated Bellman operator as follows:

(10)
{remark}

Notice that by L’Hospital’s rule one has . This implies that at the interpolated Bellman operator is equivalent to the original Bellman operator, i.e.,

Algorithm 1 presents CVaR value iteration with linear interpolation. The only difference between this algorithm and standard value iteration (9) is the linear interpolation procedure described above. In the following, we show that Algorithm 1 converges, and bound the error due to interpolation. We begin by showing that the useful properties established in Lemma 6 for the Bellman operator extend to the interpolated Bellman operator . {lemma}[Properties of Interpolated Bellman Operator] has the same properties of as in Lemma 6, namely 1) contraction and 2) concavity preservation.

Lemma 1 implies several important consequences for Algorithm 1. The first one is that the maximization problem in (10) is concave, and thus may be solved efficiently at each step. This guarantees that the algorithm is tractable. Second, the contraction property in Lemma 1 guarantees that Algorithm 1 converges, i.e., there exists a value function such that . In addition, the convergence rate is geometric and equals to .

The following theorem provides an error bound between approximate value iteration and exact value iteration (3) in terms of the interpolation resolution. {theorem}[Convergence and Error Bound] Suppose the initial value function satisfies Assumption 1 and let be an error tolerance parameter. For any state and step , choose such that and update the interpolation points according to the logarithmic rule: , , with uniform constant . Then, Algorithm 1 has the following error bound:

and the following finite time convergence error bound:

Theorem 1 shows that 1) the interpolation-based value function is a conservative estimate for the optimal solution to problem (3); 2) the interpolation procedure is consistent, i.e., when the number of interpolation points is arbitrarily large (specifically, and ), the approximation error tends to zero; and 3) the approximation error bound is , where is the log-difference of the interpolation points, i.e., , .

For a pre-specified , the condition may be satisfied by a simple adaptive procedure for selecting the interpolation points . At each iteration , after calculating in Algorithm 1, at each state in which the condition does not hold, add a new interpolation point , and additional points between and such that the condition is maintained. Since all the additional points belong to the segment , the linearly interpolated remains unchanged, and Algorithm 1 proceeds as is. For bounded costs and , the number of additional points required is bounded.

The full proof of Theorem 1 is detailed in the supplementary material; we highlight the main ideas and challenges involved. In the first part of the proof we bound, for all , the Lipschitz constant of in . The key to this result is to show that the Bellman operator preserves the Lipschitz property for . Using the Lipschitz bound and the concavity of , we then bound the error for all . The condition on is required for this bound to hold when . Finally, we use this result to bound . The results of Theorem 1 follow from contraction arguments, similar to approximate dynamic programming [Bertsekas(2012)].

5 Experiments

We validate Algorithm 1 on a rectangular grid world, where states represent grid points on a 2D terrain map. An agent (e.g., a robotic vehicle) starts in a safe region and its objective is to travel to a given destination. At each time step the agent can move to any of its four neighboring states. Due to sensing and control noise, however, with probability a move to a random neighboring state occurs. The stage-wise cost of each move until reaching the destination is , to account for fuel usage. In between the starting point and the destination there are a number of obstacles that the agent should avoid. Hitting an obstacle costs and terminates the mission. The objective is to compute a safe (i.e., obstacle-free) path that is fuel efficient.

For our experiments, we choose a grid-world (see Figure 1), for a total of 3,312 states. The destination is at position , and there are obstacles plotted in yellow. By leveraging Theorem 1, we use log-spaced interpolation points for Algorithm 1 in order to achieve a small value function error. We choose , and a discount factor for an effective horizon of 200 steps. Furthermore, we set the penalty cost equal to –such choice trades off high penalty for collisions and computational complexity (that increases as increases).

In Figure 1 we plot the value function for three different values of the CVaR confidence parameter , and the corresponding paths starting from the initial position . The first three figures in Figure 1 show how by decreasing the confidence parameter the average travel distance (and hence fuel consumption) slightly increases but the collision probability decreases, as expected. We next discuss robustness to modeling errors. We conducted simulations in which with probability each obstacle position is perturbed in a random direction to one of the neighboring grid cells. This emulates, for example, measurement errors in the terrain map. We then trained both the risk-averse () and risk-neutral () policies on the nominal (i.e., unperturbed) terrain map, and evaluated them on perturbed scenarios ( perturbed maps with Monte Carlo evaluations each). While the risk-neutral policy finds a shorter route (with average cost equal to on successful runs), it is vulnerable to perturbations and fails more often (with over failed runs). In contrast, the risk-averse policy chooses slightly longer routes (with average cost equal to on successful runs), but is much more robust to model perturbations (with only failed runs).

For the computation of Algorithm 1 we represented the concave piecewise linear maximization problem in (10) as a linear program, and concatenated several problems to reduce repeated overhead stemming from the initialization of the CPLEX linear programming solver. This resulted in a computation time on the order of two hours. We believe there is ample room for improvement, for example by leveraging parallelization and sampling-based methods. Overall, we believe our proposed approach is currently the most practical method available for solving CVaR MDPs (as a comparison, the recently proposed method in [Haskell and Jain(2014)] involves infinite dimensional optimization). The Matlab code used for the experiments is provided in the supplementary material.

Figure 1: Grid-world simulation. Left three plots show the value functions and corresponding paths for different CVaR confidence levels. The rightmost plot shows a cost histogram (for 400 Monte Carlo trials) for a risk-neutral policy and a CVaR policy with confidence level .

6 Conclusion

In this paper we presented an algorithm for CVaR MDPs, based on approximate value-iteration on an augmented state space. We established convergence of our algorithm, and derived finite-time error bounds. These bounds are useful to stop the algorithm at a desired error threshold.

In addition, we uncovered an interesting relationship between the CVaR of the total cost and the worst-case expected cost under adversarial model perturbations. In this formulation, the perturbations are correlated in time, and lead to a robustness framework significantly less conservative than the popular robust-MDP framework, where the uncertainty is temporally independent.

Collectively, our work suggests CVaR MDPs as a unifying and practical framework for computing control policies that are robust with respect to both stochasticity and model perturbations. Future work should address extensions to large state-spaces. We conjecture that a sampling-based approximate DP approach [Bertsekas(2012)] should be feasible since, as proven in this paper, the CVaR Bellman equation is contracting (as required by approximate DP methods).

Appendix A Proofs of Theoretical Results

a.1 Proof of Proposition 2.4

By definition, we have that

Note that by definition of the set , for any we have that , and

Thus,

where the last equality is by the representation theorem for CVaR [Shapiro et al.(2009)Shapiro, Dentcheva, and Ruszczyński].

a.2 Proof of Lemma 6

The proof of monotonicity and constant shift properties follow directly from the definitions of the Bellman operator, by noting that is non-negative and for any . For the contraction property, denote . Since

by monotonicity and constant shift property,

This further implies that

and the contraction property follows.

Now, we prove the concavity preserving property. Assume that is concave in for any . Let , and , and define . We have

where the first inequality is by concavity of the , and the second is by the concavity assumption. Now, define . When and , we have that and . We thus have

Finally, to show that the inner problem in (6) is a concave maximization, we need to show that

is a concave function in for any given , and . Suppose is a concave function in . Immediately we can see that is concave in when . Also notice that when , since the transition probability is non-negative, we have the result that is concave in . This further implies

is concave in . Furthermore by combining the result with the fact that the feasible set of is a polytope, we complete the proof of this claim.

a.3 Proof of Theorem 3

The first part of the proof is to show that for any ,

(11)

by induction, where the initial condition is and control action is induced by . For , we have that from definition. By induction hypothesis, assume the above expression holds at . For ,

(12)

where the initial state condition is given by . Thus, the equality in (11) is proved by induction.

The second part of the proof is to show that . Recall . Since is a contraction and is bounded, one obtains

for any . The first and the second equality follow directly from Proposition 2.1 and Proposition 2.2 in [Bertsekas(2012)] and the third equality follows from the definition of . Furthermore since is bounded for any , the result in (12) implies

Therefore, by taking , we have just shown that for any