1 Introduction
\DoubleSpacedXI\TheoremsNumberedThrough\ECRepeatTheorems\EquationsNumberedThrough\RUNAUTHOR

Zhao, Haskell, and Cardin

\RUNTITLE

Approximate Dynamic Programming Algorithm for Multi-Stage Capacity Investment Problems

\TITLE

An Approximate Dynamic Programming Algorithm for Multi-Stage Capacity Investment Problems

\ARTICLEAUTHORS\AUTHOR

Sixiang Zhao \AFFDepartment of Industrial Systems Engineering and Management, National University of Singapore, Singapore 117576, \EMAILzhaosixiang@u.nus.edu \AUTHORWilliam B. Haskell \AFFDepartment of Industrial Systems Engineering and Management, National University of Singapore, Singapore 117576, \EMAILisehwb@nus.edu.sg \AUTHORMichel-Alexandre Cardin \AFFDyson School of Design Engineering, Imperial College London, London SW7 2AZ, United Kingdom, \EMAILm.cardin@imperial.ac.uk

\ABSTRACT

This paper studies a dynamic multi-facility capacity investment problem (MCIP) with discrete capacity. In this setting, capacity adjustment decisions are made sequentially based on observations of demand. First, we formulate this problem as a Markov decision process (MDP). Then, we design a customized fitted value iteration (FVI) algorithm. In particular, we approximate the value functions with a two-layer neural network with piecewise linear activation functions. However, the action selection procedure of FVI for MCIP can be time-consuming since the action space is discrete and high dimensional. To speed up the action selection, we recast the action selection problem as a two-stage stochastic programming problem. The resulting recourse function comes from the two-layer neural network, and it is solved with a specialized multi-cut decomposition algorithm. Numerical studies show that our algorithm provides high quality solutions to the MCIP, and also that the multi-cut algorithm can significantly speed up the action selection problem of FVI in comparison to the brute-force method and the integer L-shape algorithm. Finally, we show that the techniques developed here for MCIP are applicable to many other finite-time horizon MDPs with finite but high dimensional action spaces.

\KEYWORDS

real options analysis, multi-facility capacity investment problem, discrete capacity, approximate dynamic programming, decomposition algorithm

## 1 Introduction

Strategic capacity decisions are important to companies because of the high expenditures entailed and the uncertainty associated with the business environment. To deal with the uncertainty, a wiser decision is to adjust the capacity periodically given the new information of the uncertain parameters, such as demands, instead of establishing facilities with a large amount of capacity in the beginning. Real options analysis has provided an efficient framework to evaluate the value of a system with dynamically adjusted capacity (Trigeorgis 1996). It views capacity investment decisions as a series of options, i.e. capacity adjustment options, that can be dynamically exercised. Namely, in each decision epoch, decision makers have the right, but not the obligation, to invest or salvage the capacity once new demand information is revealed. A key question in real options analysis is to evaluate the economic performance of the system with capacity adjustment options over a system without such options (i.e., the value of flexibility). Evaluating the performance of a system with such options is a multi-stage decision making problem under uncertainty, of which the optimal capacity investment policy is a function with respect to the observed uncertain parameters. In a capacity investment problem with a single facility, the capacity investment policy is to determine when and how much to adjust the capacity when the uncertain customer demands are observed over time. In a multi-facility problem, the system has not only the options to adjust capacity, but also the options to switch between facilities; for example, if one facility runs out of capacity, the excess demands can be transported to adjacent facilities by paying extra fees. In this context, evaluating the economic performance becomes harder as the decisions are multi-dimensional. Though multi-dimensional, the form of the optimal policy can still be found when the capacity investment problem is convex—i.e. the capacity is continuous and the capacity adjustment costs are convex (Eberly and Van Mieghem 1997). However, these assumptions are too strict in practice as the capacity of a production system is usually modular-designed; that is, the capacity is discrete.

Motivated by the issues outlined above, we study a multi-facility capacity investment problem (MCIP) with discrete capacity in this paper. Such problems have been widely studied in many fields, such as transportation systems (Sun and Schonfeld 2015), the semiconductor industry (Huang and Ahmed 2009), and energy systems (Singh et al. 2009). To solve this problem, one prevalent method is multi-stage stochastic programming, where the evolution of the uncertain parameters is modeled as a scenario tree and the problem solved as a large-scale mixed integer programming problem. However, the scenario tree grows exponentially with the number of stages and the dimension of the uncertain parameters, so this method can easily become intractable. Another method that can solve MCIP is the decision rule-based multi-stage stochastic programming, where the policy space of the problem is approximated given a preset functional space. Zhao et al. (2018) have designed an if-then decision rule to solve a multi-facility capacity expansion problem with uncertain demands, but the proposed decision rule may require redesigns once capacity contraction or other types of uncertainty, such as price, are considered.

As an alternative, the MCIP can be formulated as a Markov decision process (MDP), and solved via dynamic programming (DP)-based algorithms. However, DP-based methods are subject to the curse of dimensionality; that is to say, the complexity of the algorithm increases exponentially with the number of customers and of facilities. To alleviate the burden brought by the high dimensional state space, we can use approximate dynamic programming (ADP)—more specifically, we can use the fitted value iteration (FVI) algorithm—to approximate the value functions of the problem. In FVI, the value function of the last stage is firstly approximated and then the algorithm proceeds via going backward in time. In each time period, given each generated sample from the state space, one needs to choose the optimal action by solving an action selection problem. Since the action space is finite, the selection problem can be done by enumeration, but this method can be extensively time-consuming in MCIP, whose action space is multi-dimensional.

In this paper, we propose a neural network-based fitted value iteration (NN-FVI) to solve MCIPs, where the capacity can be either expanded or contracted. Then, a decomposition algorithm is proposed to accelerate the action selection procedure of NN-FVI. Theoretical guarantee of the algorithm is provided, and the numerical studies herein show that it can speed up NN-FVI significantly in comparison with the enumeration method. More specifically, the contributions of this paper are summarized below:

1. We analyze the functional form of the value functions of MCIP, and show that the value functions present some piecewise linear structure. Then, we approximate the value functions by two-layer neural networks with rectified linear units (ReLU) being activation functions. We also prove that the NN-FVI algorithm with ReLU is consistent; that is, an -optimal solution with probability can be achieved under some mild conditions.

2. Since the action space of MCIP is multi-dimensional, it can be time-consuming in selecting the optimal action via enumeration. To accelerate the action selection procedure, we formulate it as a two stage stochastic programming problem with a non-convex recourse function where the recourse function is derived from the two-layered neural network. This type of stochastic programming problems has been widely studied in (Laporte and Louveaux 1993, Sen and Sherali 2006, Sen et al. 2010), but our method differs from the literature in that we design a multi-cuts method by making use of the gradient information of the recourse function. We also show that the proposed multi-cut algorithm can converge to the global optimum in a finite number of steps, and the numerical studies herein show that it can speed up the algorithm significantly in comparison with the enumeration method or the integer L-shape algorithm.

3. We explain how to extend our method to other dynamic programming problems, such as MCIP with lead time or MCIP with uncertain rewards/costs; the proposed algorithm works as long as problems have combinatorial action selection and the value functions are approximated via two-layer neural networks with ReLU activation functions.

The remainder of the paper is organized as below. Related work about multi-stage capacity planning problems and their solutions are reviewed in Section 2. In Section 3, an MCIP model with discrete capacity is presented. The proposed NN-FVI, along with the analysis of the value functions, are presented in Section 4. Section 5 introduces the proposed multi-cut decomposition algorithm to solve the action selection procedure. The numerical studies are presented in Section 6 and we discuss the extensions of the proposed method in Section 7. In the last section, we conclude the major findings and arguments of this study.

## 2 Related Work

### 2.1 Capacity Investment Problems

Since the seminal work (Manne 1961), capacity investment problems with stochastic demands have been widely studied. Analytical results for capacity expansion problem with a single facility and discrete capacity have been derived by (Angelus et al. 2000), where the authors found that the optimal policy of the problem is similar to the well-known policy in inventory management. More recently, papers investigated problems where contraction of capacity is allowed (Angelus and Porteus 2002). Ye and Duenyas (2007) studied a capacity investment problem with two-sided fixed costs, and characterized the structure of the optimal policy. This result was further extended by considering risk aversion of decision makers (Lu and Yan 2016). For multi-facility problems, Eberly and Van Mieghem (1997) studied a multi-factors capacity investment problem and derived the structure of the optimal policy, but the result is subject to the assumptions that the capacity is continuous and the capacity adjustment costs are convex. Two-stage multi-facility capacity expansion models with discrete capacity have been studied in the semiconductor industry (Geng et al. 2009) or in transportation systems (Dong et al. 2015), where the first stage sets up capacity plan and the second stage includes the wait-and-see allocation decisions among the installed facilities. However, the two-stage models are inflexible as the capacity investment plan is unchanged during the planning horizon, regardless of the realizations of uncertainty. Huang and Ahmed (2009) studied a multi-facility capacity expansion problem with discrete capacity, and derived an analytical bound for the value of the multi-stage model over that of the two-stage model. Truong and Roundy (2011) developed a multidimensional approximation algorithm to solve capacity expansion problems by decomposing the problem via a cost-separation scheme. Similar applications about multi-stage multi-facility capacity expansion models with discrete capacity can also be found in waste-to-energy systems (Zhao et al. 2018), or in mobility on-demand transportation (Cardin et al. 2017a). For comprehensive review, interesting readers can also refer (Luss 1982, Van Mieghem 2003, Martínez-Costa et al. 2014) for details.

### 2.2 Stochastic Programming Methods

One prevalent method to solve multi-period optimization problems under uncertainty is multi-stage stochastic programming. Ahmed et al. (2003) solved a multi-facility capacity expansion problem via a scenario tree-based multi-stage stochastic programming. Huang and Ahmed (2009) and Taghavi and Huang (2016) decomposed the scenario tree-based model across facilities, and then solved the sub-problems via a linear programming (LP) based approximation algorithm. Taghavi and Huang (2018) considered a similar problem with budget constraints, and solved it with a heuristic Lagrangian relaxation method. Another algorithm that can solve the MCIP with discrete capacity is stochastic dual dynamic integer programming (Zou et al. 2018); it formulates the uncertain parameters as a scenario tree and the problem is solved based on a nested decomposition scheme. However, when solving MCIP, the number of nodes of a scenario tree not only increases exponentially with the number of stages but also with the dimensions of the uncertain parameters, which make the size of the tree astronomical and the convergence of the algorithm slow.

Decision rule-based methods are proposed to solve multi-stage capacity planning problems. A decision rule is a function which maps the realized uncertainty to the decisions; it approximates the policy space of a multi-stage problem by specifying the functional form of the policy (Shapiro et al. 2009). In the previous studies, if-then decision rules have been proposed to solve a multi-stage capacity expansion problem. An if-then decision can be stated as: if the gap of capacity exceeds a certain threshold, then expand the capacity to a certain level (Cardin et al. 2017b). Zhao et al. (2018) extended the if-then decision rule to a multi-facility problem by introducing a weighted matrix, and designed a decomposition algorithm to optimize the control parameters of the rule. Unfortunately, these methods only work for systems with capacity expansion options only. If one needs to extend the if-then decision rules in solving the problems with capacity contraction, the proposed rules require redesigns: another if-then statement for capacity contraction is needed, and it may complicate the problem because additional binary variables are introduced. In addition, the proposed if-then rules do not work if other uncertain parameters, besides demands, are introduced.

### 2.3 Dynamic Programming Methods

As another alternative, approximate dynamic programming (ADP), or reinforcement learning, incorporates the capacity decisions and uncertain parameters into the underlying MDP. The basic idea of ADP is to approximate the value functions or functions via state aggregation (Bertsekas 2012), basic functions approximation (Munos and Szepesvári 2008, Antos et al. 2008, Powell 2011), or sample average approximation (Haskell et al. 2016). Many of the ADPs in the literature assumed that the action space of the MDP is finite, so that we can enumerate all possible actions in the action selection procedure. However, it can be extensively time-consuming to find the optimal action via enumeration when the action space is multi-dimensional. Policy gradient algorithms have been proposed to find the optimal action for problems with continuous and high-dimensional action spaces (Lillicrap et al. 2015, Schulman et al. 2015), but this type of methods may not work for derivative-free problems such as MCIPs with discrete capacity. Other methods to handle with the derivative-free problems include direct policy search—the policy of the problem is approximated, for example, via neural networks—and the optimization of such policy is done by a black-box optimization (Hu et al. 2017). Jain and Varaiya (2010) approximated the policies via sampling and proposed a simulation-based optimization framework for policy improvements. However, the required number of samples of this type of methods could be large. The NN-FVI algorithm studied in this paper has been partly presented in a conference paper (Zhao et al. 2017), but Zhao et al. (2017) studied a problem with capacity expansion only and the proposed algorithm selects the optimal action via enumeration.

## 3 Problem Description

Notation. Throughout this paper, we denote as an -dimensional vector of zero. Denote as the set of integers and as the set of non-negative integers. We further denote as the support of a random variable. Let and denote the dimensions and extreme points of a set, respectively. We denote as the -norm for vectors; if , we use instead of . Finally, for a real-valued measurable function and a probability distribution defined over , we define the -norm by .

Consider a multi-stage stochastic capacity investment problem with customers and facilities . In each time period, customer demand is allocated to and satisfied by the facilities. The objective is to maximize the net present value (NPV) by determining the optimal capacity investment plan and demand allocation plan over the finite planning horizon . Denote as the random demand from customer in time period , and denote its vector form as . Denote as a realization of , with vector form . Let be the installed capacity vector at the end of time period , and let be the initial capacity vector. Define to be the domain of the demand, i.e. Define as the domain of the capacity and as a vector of capacity limits (we assume the capacity is bounded), so that

 K≜{K∈ZN+∣∣0≤K≤K% max}.

Our main assumptions are listed below. {assumption} The process, , is a Markov process with components. Without loss of generality, we assume is independent of the installed capacity for all .

Given Assumption 3, if denote to be the conditional probability function, then we have for all . If demand is continuous, then is the conditional density function.

{assumption}

The demand is non-negative and bounded, i.e. there exists such that for all . If the demand is continuous, we further assume the conditional densities for all are Lipschitz continuous. More specifically, there exists such that

 ∣∣f(dt|dt−1)−f(dt∣∣d′t−1)∣∣≤Ld∥∥dt−1−d′t−1∥∥,  ∀dt−1,d′t−1,dt∈D,t∈T.
{assumption}

The system has the option to expand/contract the installed capacity from to at the end of each period , and capacity adjustments are instantaneous. {assumption} The expansion cost and salvage value are linear with respect to the capacity, and the per unit expansion cost is not smaller than the per unit salvage value.

Assumption 3 is standard since real-world demands are always finite and their variation from one period to the next does not surge to infinity. In addition, we assume that the lead time of capacity adjustment (compared to the length of each time period) is negligible in our strategic capacity investment problem. However, the lead time for capacity adjustment in some industries can be long. For example, in the semiconductor industry, the lead time for purchasing machines can vary from six to eighteen months (Truong and Roundy 2011).

In this paper, we consider an MCIP with linear expansion cost and salvage value, but the proposed method can also solve more general cases whose expansion cost/salvage value are nonlinear. In addition, the resale value of an asset is usually smaller than its purchase price, so the second statement of Assumption 3 is realistic. The notation for our model is summarized in Table 1.

In each period, we can allocate the realized demand to the facilities, given the constraints of the currently installed capacity . A penalty with unit cost is incurred if demand from customer is unsatisfied. Denote as the amount of demand allocated from customer to facility in time period and as its corresponding revenue. Denote the operating profit in time , it is given by:

 Πt(Kt−1,dt) =max∑i∈I∑n∈Nrintzint−∑i∈Ibit(dit−∑n∈Nzint) (1)
 s. t. ∑i∈Izint≤Kn(t−1), ∀n∈N, (2) ∑n∈Nzint≤dit, ∀i∈I, (3) zint≥0, ∀i∈I,n∈N. (4)

In the above, for all are the decision variables that allocate realized demands to the installed facilities and the objective function (1) is to maximize the current rewards, which consist of the revenues and the penalty for unsatisfied demands. Constraints (2) and (3) are capacity and demand constraints respectively. Note that the allocation decisions depend on the current state only and do not affect future activity.

Denote as the capacity adjustment costs in time . Denote and as the unit expansion cost and unit salvage value for facility respectively. According to Assumption 3, we have for all . The cost function is then

 ct(Kt−1,Kt)≜∑n∈Nmax{−q−nt(Kn(t−1)−Knt),q+nt(Knt−Kn(t−1))},

which is convex in .

Based on the aforementioned assumptions, the MCIP can be modeled as an MDP. The state in time period can be represented by a two tuple —i.e. the installed capacity in time and the realized demands—and the action is the adjusted capacity . The state space of our problem, therefore, is and the action space is for all . Denote as a Markov decision rule for time . We denote the class of Markov policies as . Without loss of generality, we assume the initial state for the MCIP is , and the system salvages all installed capacity at the end of period so . MCIP can then be formulated as the following dynamic optimization problem:

 max(K0,…,KT)∈K−c0(0N,K0)+E[∑t∈Tγt(Πt(Kt−1,Dt)−ct(Kt−1,Kt))] (5)

where is the discount factor. The objective of the above problem is to find an optimal policy, i.e. , such that the expected total rewards are maximized. We can solve Problem (5) via its dynamic programming equations. {theorem} Let and define the following dynamic equations for all :

 Vt(Kt−1,dt)= maxKt∈K{Πt(Kt−1,dt)−ct(Kt−1,Kt)+γE[Vt+1(Kt,Dt+1)|dt]}, t∈T, (6) V0(0N,d0)= maxK0∈K{−c0(0N,K1)+γE[V1(K0,D2)|d0]}, t=0. (7)

Given Assumption 3, Eqs. (6)–(7) recover the optimal policy of Problem (5).

Above, we model MCIP as a finite horizon MDP with a finite action space. This problem can be solved by DP-based algorithms, such as value iteration (VI), but VI is not applicable to MCIP when the dimensions of the state/action space are high. More specifically, if demands are discrete, the complexity of VI is , where is of dimensions and is of dimensions. For example, consider a system with four customers and three facilities and . If and demands of each customer are integer values ranging from to , then the complexity of VI is . Therefore, VI is intractable even for a medium size problem. If the demands are continuous, the state space is infinite and exact VI cannot be done.

### 3.1 The FVI Algorithm

Since the state space is large or continuous, evaluating the exact value functions in Eqs. (6)–(7) is intractable because of the curse of dimensionality. To respond to this challenge, we will use FVI to fit the value functions by a finite number of samples generated from the state space. The required number samples is generally much smaller than the original state space. Then, we can approximately solve the Bellman equations in Eqs. (6)–(7) by using the approximated value functions instead of the exact ones.

Define and as the sets of indices for the samples generated from the state space and from the state transitions, respectively. First, a set of state samples is drawn from in the last period , and their values are calculated according to Eq. (6). Then, a set of samples, i.e. , can be generated. Denote as the parametric approximate value functions given adjustable parameters , where and is finite (a simple example is to approximate the value functions via linear basis functions, i.e. ). Then, can be trained by solving the following regression problem:

 ^wT=argminw∈W1S1∑s∈S1(~VT(KsT−1,dsT;w)−^VT(KsT−1,dst))2+β2w⊤w,

where is the regularization parameter (regularizing the objective function can sometimes improve the generalization of the function fitting and avoid overfitting). Subsequently, the algorithm proceeds backwards in time . In time , we draw number of samples of current states. Since the transitions of demands are independent of the actions, for each state sample , we generate the future transitions given via Monte Carlo simulations; that is

 ds′t+1(s)∼P(⋅∣∣dst),  ∀s∈S1,s′∈S2,t∈T∖{T}.

Then, we can calculate the estimated values given the trained function , and solve . The detailed procedure of the algorithm is summarized below.

There are two questions that need to be answered about the above FVI algorithm. First, what is the appropriate approximator such that can fit the true function with arbitrary precision? Second, the action selection problem, i.e. Eq. (8), involves a multi-dimensional action space such that it is difficult to solve via the enumeration method; how can we speed up this algorithm? We can answer both questions. We discuss the choice of approximator in Section 4 and then introduce a decomposition algorithm to accelerate Problem (8) in Section 5.

## 4 Neural Network-based Fitted Value Iteration Algorithm

Munos and Szepesvári (2008, Theorem 2 & Corollary 4) have shown that FVI can achieve an -optimal solution with probability when: (i) the functional family used for approximation is sufficiently rich, and (ii) the sample complexity increase polynomially in the scale of the problem instance. In particular, we want to show that the value functions of our MDP are Lipchitz functions. Then, we will show that our approximator is rich enough to handle Lipschitz functions. We first verify that the value functions of MCIP are Lipchitz, and then we discuss the choice of approximators such that FVI is able to derive -optimal solution with high probability.

### 4.1 Fitting the Value Function

The value functions for all are defined over , which is not connected since is finite. In addition, if demand is discrete, then the state space is finite and discrete. To have a better understanding of the structure of the value functions, we wish to extend to its smallest connected superset. Then, we will construct a set of extended value functions.

First we define

 ¯S≜{¯K×D}  and  ¯K≜{K∈RN∣∣0≤K≤Kmax},

where . Note that can be either continuous or discrete in the above definitions. Now, we may define extended value functions for all . If can recover the exact values of on , then we can analyze the characteristics of to learn about . The study of is more amenable since is a connected set.

The dynamic programming equations for the extended value functions at are given by and

 ¯Vt(Kt−1,dt)≜ maxKt∈K{Πt(Kt−1,dt)−ct(Kt−1,Kt)+γE[¯Vt+1(Kt,Dt+1)∣∣dt]}, t∈T, (10) ¯V0(0N,d0)≜ maxK0∈K{−c0(0N,K1)+γE[¯V1(K1,D2)∣∣d0]}, t=0. (11)

Note that the values of and are attained given any .

Next we show that the value functions can be recovered from the extended functions on .

{proposition}

for all .

###### Proof.

Proof. According to the definition of , we have at the points of . Suppose we have for all . Since the demand transitions are independent of and , we have for all . Therefore, for all according to Eq. (6) and Eq. (10). The result then follows by backward induction. \Halmos

We now show that the extended value functions of MCIP, i.e. , are bounded and Lipchitz. First, we show that the reward function is bounded by a constant

 vmax≜∑i∈Imaxt∈T,n∈N(rint+bit)Dmaxi+∑n∈Nmaxt∈Tq+ntKmaxn.
{lemma}

We have for all .

###### Proof.

Proof. As is a bounded set and for all , an upper and lower bound on follows if we have unlimited or no capacity:

 −∑i∈Imaxt∈TbitDmaxi≤Πt(Kt−1,dt)≤∑i∈Imaxt∈T,n∈NrintDmaxi.

Therefore,

 |Πt(Kt−1,dt)|≤∑i∈Imaxt∈T,n∈N(rint+bit)Dmaxi.

Similarly, bounds on follow by assuming the capacity is changed from zero to or the reverse:

 −∑n∈Nmaxt∈Tq−ntKmaxn≤ct(Kt−1,Kt)≤∑n∈Nmaxt∈Tq+ntKmaxn.

Since we have assumed , it follows that

 |ct(Kt−1,Kt)|≤∑n∈Nmaxt∈Tq+ntKmaxn.

Then, for all , we have

 |Πt(Kt−1,dt)−ct(Kt−1,Kt)| ≤|Πt(Kt−1,dt)|+|ct(Kt−1,Kt)| ≤∑i∈Imaxt∈T,n∈N(rint+bit)Dmaxi+∑n∈Nmaxt∈Tq+ntKmaxn =vmax,

which concludes the proof. \Halmos

Next, we show that the extended value functions are Lipschitz where

 Lv≜(2+T∑τ=0γτ+1Ld)vmax.

This property is extremely important because it allows us to show that our neural net approximation architecture is rich enough.

{proposition}

The functions for all in Eq. (10) are Lipschitz, i.e. for all and .

###### Proof.

Proof. We only provide the proof for the continuous demand case with density function ; the discrete case can be proved similarly. To simplify the notation, we denote the state variable as and the action as . Firstly, we show that for all are bounded. According to Lemma 4.1, we have for all . For , we have , for all . Note that is independent with , and we have

 ∣∣¯Vt(x)∣∣≤maxa∈K∣∣∣Πt(x)−ct(x,a)+γ∫x′∈S¯Vt+1(x′)dP(x′|x)∣∣∣≤(T∑τ=0γτ)vmax,  ∀x∈¯S,t∈T.

Then, we prove the Lipschitz condition for . Firstly, for all , compute

 ≤maxa∈K∣∣Πt(x)−ct(x,a)−Πt(x′)+ct(x′,a)∣∣ ≤2vmax∥∥x−x′∥∥.

For , we have

 ∣∣¯VT(x)−¯VT(x′)∣∣=∣∣ΠT(x)−ΠT(x′)+cT(x,a)−cT(x′,a)∣∣≤2vmax∥∥x−x′∥∥.

In general, for , we have

 ≤ ≤ 2vmax∥∥x−x′∥∥+γ(maxa∈K∫∣∣¯Vt+1(y)(P(y|x)−P(y|x′))∣∣dy) (12) ≤ 2vmax∥∥x−x′∥∥+γ(T∑τ=0γτ)vmax∫∣∣f(y|x)−f(y|x′)∣∣dy (13) ≤ (2+T∑τ=0γτ+1Ld)vmax∥∥x−x′∥∥, (14)

using the fact that for are bounded according to Assumption 3. Thus, there exists such that are Lipschitz functions.

If demands are discrete, we can replace the integration in Eq. (12)–(14) by summation, and the result holds trivially since demands automatically are bounded in this case. \Halmos

FVI proceeds via backward induction: the value function in the last period is solved first, then approximated, and then the remaining value functions are calculated by going backwards in time. If the approximation in the last period is poor, the error can be passed on to the approximation in earlier stages. In the next result, we use backward induction to understand the structure of .

{proposition}

The extended value functions, i.e. Eqs. (10)–(11), have the following properties:

1. is a piecewise linear function;

2. for all are piecewise linear functions if is finite.

###### Proof.

Proof. First, we know that piecewise linearity is preserved under finite summation and the max/min of a finite collection. Now, observe that is a piecewise linear function when defined over . This result follows by transforming into its dual. Since is a linear programming problem and the optimal value is finite and attained for all , strong duality holds (see e.g. (Bertsimas and Tsitsiklis 1997, Chapter 4)). Thus, we have

 Πt(Kt−1,dt)=minμn,λi≥0 ∑i∈I(bit−λi)dit−∑n∈NμnKn(t−1) s. t. (μn+λi−rint−bit)≥0,  ∀i∈I,n∈N.

Let denote the feasible set of for the above problem, where and . Since the dual is also a linear programming problem, the optimal solutions can be chosen from the extreme points of their feasible regions (Bertsimas and Tsitsiklis 1997, Chapter 3), and so the above problem is equivalent to

 min(μ,λ)∈ext(Λ)∑i∈I(bit−λi)dit−∑n∈NμnKn(t−1).

Therefore, is piecewise linear and concave in since it is the min of a finite collection of linear functions. Since and is convex in , is piecewise linear for all and concave in .

To prove (ii), as we have shown that is piecewise linear and concave, by backward induction, we have

 ¯Vt(Kt−1,dt)=maxKt∈K{Π(Kt−1,dt)−c(Kt−1,Kt)+γE[¯Vt+1(Kt,Dt+1)|dt]},  ∀t∈T∖{T}.

Since is piecewise linear and is finite, is piecewise linear as it is a finite sum of piecewise linear functions. Therefore, is piecewise linear as the max of a finite set of piecewise linear functions. \Halmos

{remark}

Note that for all are non-concave in given any . As is concave in given any , is concave in given (see the proof in Proposition 4.1), may be non-concave as it is a finite max of concave functions, and the result follows by backward induction.

According to Proposition 4.1(i), we know that is a piecewise linear function if the capacity is defined over a connected space . However, for , the value functions , according to Proposition 4.1(ii), are piecewise linear only when the demand is discrete. Fortunately, if the demands are continuous, we can use Monte-Carlo simulation to generate finitely many samples of future transitions to approximate the expectation in Eqs. (6)–(7). In this setting, the approximate value functions are still piecewise linear. However, given Remark 4.1, we may need to solve non-convex optimization problems in each .

### 4.2 Two-layer Neural Network with ReLU

To solve the MCIP, we approximate the value functions of the problem by using neural networks with piecewise linear activation functions. Neural networks are powerful approximators, and can incorporate structure of the target function (Jain et al. 1996). In our case, we will use two-layer neural networks. As we will see later in this section, a two-layer network is powerful enough to approximate our value functions arbitrarily well.

A two-layer neural network consists of inputs, one hidden layer for intermediate computations, and an output. Let index the neurons in the hidden layer of our network. The general form of a two-layer neural network is then

 Γ(x)=∑j∈JwjΨj(u⊤jx+u0j)+w0j,

where and are the adjustable weights of the input layer and the hidden layer respectively, and is the activation function for neuron . In MCIP, the inputs of the networks are the states and the outputs are the approximate values Based on Proposition 4.1, we choose the activation functions to be ReLU, which are themselves piecewise linear

 Ψj(u⊤jx+u0j)=max{u⊤jx+u0j,0}.

The architecture of the neural network is shown in Figure 1.

The adjustable weights for the input of neuron in include and an vector of adjustable coefficients

 ujt=(u1jt,…,uNjt,u(N+1)jt,…,u(N+I)jt),  ∀j∈J.

Denote as the vector of all adjustable weights for the input layer and

 wt=(w0t,w1t,…,wJt)

as a vector of the adjustable parameters for the output of the hidden layer. The neural network at time can then be represented by a function given the input and the adjustable weights :

 ~Vt(Kt−1,dt;ut,wt)=∑