Empirical Dynamic Programming
William B. Haskell
EE & CS Departments, University of Southern California
Rahul Jain††thanks: Corresponding author. This research was supported via NSF CAREER award CNS-0954116 and ONR Young Investigator Award N00014-12-1-0766.
EE & ISE Departments, University of Southern California
EE Department, University of Southern California
We propose empirical dynamic programming algorithms for Markov decision processes (MDPs). In these algorithms, the exact expectation in the Bellman operator in classical value iteration is replaced by an empirical estimate to get ‘empirical value iteration’ (EVI). Policy evaluation and policy improvement in classical policy iteration are also replaced by simulation to get ‘empirical policy iteration’ (EPI). Thus, these empirical dynamic programming algorithms involve iteration of a random operator, the empirical Bellman operator. We introduce notions of probabilistic fixed points for such random monotone operators. We develop a stochastic dominance framework for convergence analysis of such operators. We then use this to give sample complexity bounds for both EVI and EPI. We then provide various variations and extensions to asynchronous empirical dynamic programming, the minimax empirical dynamic program, and show how this can also be used to solve the dynamic newsvendor problem. Preliminary experimental results suggest a faster rate of convergence than stochastic approximation algorithms.
Key words: dynamic programming; empirical methods; simulation; random operators; probabilistic fixed points.
MSC2000 subject classification: 49L20, 90C39, 37M05, 62C12, 47B80, 37H99.
OR/MS subject classification: TBD.
History: Submitted: July 3, 2019
Markov decision processes (MDPs) are natural models for decision making in a stochastic dynamic setting for a wide variety of applications. The ‘principle of optimality’ introduced by Richard Bellman in the 1950s has proved to be one of the most important ideas in stochastic control and optimization theory. It leads to dynamic programming algorithms for solving sequential stochastic optimization problems. And yet, it is well-known that it suffers from a “curse of dimensionality” [5, 32, 19], and does not scale computationally with state and action space size. In fact, the dynamic programming algorithm is known to be PSPACE-hard .
This realization led to the development of a wide variety of ‘approximate dynamic programming’ methods beginning with the early work of Bellman himself . These ideas evolved independently in different fields, including the work of Werbos , Kushner and Clark  in control theory, Minsky , Barto, et al  and others in computer science, and Whitt in operations research [45, 46]. The key idea was an approximation of value functions using basis function approximation , state aggregation , and subsequently function approximation via neural networks . The difficulty was universality of the methods. Different classes of problems require different approximations.
Thus, alternative model-free methods were introduced by Watkins and Dayan  where a Q-learning algorithm was proposed as an approximation of the value iteration procedure. It was soon noticed that this is essentially a stochastic approximations scheme introduced in the 1950s by Robbins and Munro  and further developed by Kiefer and Wolfowitz  and Kushner and Clarke . This led to many subsequent generalizations including the temporal differences methods  and actor-critic methods [24, 25]. These are summarized in [9, 40, 34]. One shortcoming in this theory is that most of these algorithms require a recurrence property to hold, and in practice, often work only for finite state and action spaces. Furthermore, while many techniques for establishing convergence have been developed , including the o.d.e. method [28, 12], establishing rate of convergence has been quite difficult . Thus, despite considerable progress, these methods are not universal, sample complexity bounds are not known, and so other directions need to be explored.
A natural thing to consider is simulation-based methods. In fact, engineers and computer scientists often do dynamic programming via Monte-Carlo simulations. This technique affords considerable reduction in computation but at the expense of uncertainty about convergence. In this paper, we analyze such ‘empirical dynamic programming’ algorithms. The idea behind the algorithms is quite simple and natural. In the DP algorithms, replace the expectation in the Bellman operator with a sample-average approximation. The idea is widely used in the stochastic programming literature, but mostly for single-stage problems. In our case, replacing the expectation with an empirical expectation operator, makes the classical Bellman operator a random operator. In the DP algorithm, we must find the fixed point of the Bellman operator. In the empirical DP algorithms, we must find a probabilistic fixed point of the random operator.
In this paper, we first introduce two notions of probabilistic fixed points that we call ‘strong’ and ‘weak’. We then show that asymptotically these concepts converge to the deterministic fixed point of the classical Bellman operator. The key technical idea of this paper is a novel stochastic dominance argument that is used to establish probabilistic convergence of a random operator, and in particular, of our empirical algorithms. Stochastic dominance, the notion of an order on the space of random variables, is a well developed tool (see [30, 38] for a comprehensive study).
In this paper, we develop a theory of empirical dynamic programming (EDP) for Markov decision processes (MDPs). Specifically, we make the following contributions in this paper. First, we propose both empirical value iteration and policy iteration algorithms and show that these converge. Each is an empirical variant of the classical algorithms. In empirical value iteration (EVI), the expectation in the Bellman operator is replaced with a sample-average or empirical approximation. In empirical policy iteration (EPI), both policy evaluation and the policy iteration are done via simulation, i.e., replacing the exact expectation with the simulation-derived empirical expectation. We note that the EDP class of algorithms is not a stochastic approximation scheme. Thus, we don’t need a recurrence property as is commonly needed by stochastic approximation-based methods. Thus, the EDP algorithms are relevant for a larger class of problems (in fact, for any problem for which exact dynamic programming can be done.) We provide convergence and sample complexity bounds for both EVI and EPI. But we note that EDP algorithms are essentially “off-line” algorithms just as classical DP is. Moreover, we also inherit some of the problems of classical DP such as scalability issues with large state spaces. These can be overcome in the same way as one does for classical DP, i.e., via state aggregation and function approximation.
Second, since the empirical Bellman operator is a random monotone operator, it doesn’t have a deterministic fixed point. Thus, we introduce new mathematical notions of probabilistic fixed points. These concepts are pertinent when we are approximating a deterministic operator with an improving sequence of random operators. Under fairly mild assumptions, we show that our two probabilistic fixed point concepts converge to the deterministic fixed point of the classical monotone operator.
Third, since scant mathematical methods exist for convergence analysis of random operators, we develop a new technique based on stochastic dominance for convergence analysis of iteration of random operators. This technique allows for finite sample complexity bounds. We use this idea to prove convergence of the empirical Bellman operator by constructing a dominating Markov chain. We note that there is an extant theory of monotone random operators developed in the context of random dynamical systems  but the techniques for convergence analysis of random operators is not relevant to our context. Our stochastic dominance argument can be applied for more general random monotone operators than just the empirical Bellman operator.
We also give a number of extensions of the EDP algorithms. We show that EVI can be performed asynchronously, making a parallel implementation possible. Second, we show that a saddle point equilibrium of a zero-sum stochastic game can be computed approximately by using the minimax Bellman operator. Third, we also show how the EDP algorithm and our convergence techniques can be used even with continuous state and action spaces by solving the dynamic newsvendor problem.
A key question is how is the empirical dynamic programming method different from other methods for simulation-based optimization of MDPs, on which there is substantial literature. We note that most of these are stochastic approximation algorithms, also called reinforcement learning in computer science. Within this class. there are Q learning algorithms, actor-critic algorithms, and appproximate policy iteration algorithms. Q-learning was introduced by Watkins and Dayan but its convergence as a stochastic approximation scheme was done by Bertsekas and Tsitsiklis . Q-learning for the average cost case was developed in , and in a risk-sensitive context was developed in . The convergence rate of Q-learning was established in  and similar sample complexity bounds were given in . Actor-critic algorithms as a two time-scale stochastic approximation was developed in . But the most closely related work is optimistic policy iteration  wherein simulated trajectories are used for policy evaluation while policy improvement is done exactly. The algorithm is a stochastic approximation scheme and its almost sure convergence follows. This is true of all stochastic approximation schemes but they do require some kind of recurrence property to hold. In contrast, EDP is not a stochastic approximation scheme, hence it does not need such assumptions. However, we can only guarantee its convergence in probability.
A class of simulation-based optimization algorithms for MDPs that is not based on stochastic approximations is the adaptive sampling methods developed by Fu, Marcus and co-authors [13, 14]. These are based on the pursuit automata learning algorithms [41, 35, 31] and combine multi-armed bandit learning ideas with Monte-Carlo simulation to adaptively sample state-action pairs to approximate the value function of a finite-horizon MDP.
Some other closely related works are  which introduces simulation-based policy iteration (for average cost MDPs). It basically shows that almost sure convergence of such an algorithm can fail. Another related work is . wherein a simulation-based value iteration is proposed for a finite horizon problem. Convergence in probability is established if the simulation functions corresponding to the MDP is Lipschitz continuous. Another closely related paper is , which considers value iteration with error. We note that our focus is on infinite horizon discounted MDPs. Moreover, we do not require any Lipschitz continuity condition. We show that EDP algorithms will converge (probabilistically) whenever the classical DP algorithms will converge (which is almost always).
A survey on approximate policy iteration is provided in . Approximate dynamic programming (ADP) methods are surveyed in . In fact, many Monte-Carlo-based dynamic programming algorithms are introduced herein (but without convergence proof.) Simulation-based uniform estimation of value functions was studied in [20, 21]. This gave PAC learning type sample complexity bounds for MDPs and this can be combined with policy improvement along the lines of optimistic policy iteration.
This paper is organized as follows. In Section id1, we discuss preliminaries and briefly talk about classical value and policy iteration. Section 2.1 presents empirical value and policy iteration. Section 3.2 introduces the notion of random operators and relevant notions of probabilistic fixed points. In this section, we also develop a stochastic dominance argument for convergence analysis of iteration of random operators when they satisfy certain assumptions. In Section id1, we show that the empirical Bellman operator satisfies the above assumptions, and present sample complexity and convergence rate estimates for the EDP algorithms. Section id1 provides various extensions including asychronous EDP, minmax EDP and EDP for the dynamic newsvendor problem. Basic numerical experiments are reported in Section id1.
We first introduce a typical representation for a discrete time MDP as the 5-tuple
Both the state space and the action space are finite. Let denote the space of probability measures over and we define similarly. For each state , the set is the set of feasible actions. The entire set of feasible state-action pairs is
The transition law governs the system evolution, for all , i.e., for is the probability of visiting the state next given the current state-action pair . Finally, is a cost function that depends on state-action pairs.
Let denote the class of stationary deterministic Markov policies, i.e., mappings which only depend on history through the current state. We only consider such policies since it is well known that there is an optimal policy in this class. For a given state , is the action chosen in state under the policy . We assume that only contains feasible policies that respect the constraints . The state and action at time are denoted and , respectively. Any policy and initial state determine a probability measure and a stochastic process defined on the canonical measurable space of trajectories of state-action pairs. The expectation operator with respect to is denoted .
We will focus on infinite horizon discounted cost MDPs with discount factor . For a given initial state , the expected discounted cost for policy is denoted by
The optimal cost starting from state is denoted by
and denotes the corresponding optimal value function in its entirety.
The Bellman operator is defined as
for any , where is the random next state visited, and
is the explicit computation of the expected cost-to-go conditioned on state-action pair . Value iteration amounts to iteration of the Bellman operator. We have a sequence where for all and an initial seed . This is the well-known value iteration algorithm for dynamic programming.
We next state the Banach fixed point theorem which is used to prove that value iteration converges. Let be a Banach space with norm . We call an operator a contraction mapping when there exists a constant such that
(Banach fixed point theorem) Let be a Banach space with norm , and let be a contraction mapping with constant . Then,
(i) there exists a unique such that ;
(ii) for arbitrary , the sequence converges in norm to as ;
(iii) for all .
For the rest of the paper, let denote the space of contraction mappings from . It is well known that the Bellman operator with constant is a contraction operator, and hence has a unique fixed point . It is known that value iteration converges to as . In fact, is the optimal value function.
Policy iteration is another well known dynamic programming algorithm for solving MDPs. For a fixed policy , define as
The first step is a policy evalution step. Compute by solving for . Let be the vector of one period costs corresponding to a policy , and , the transition kernel corresponding to the policy . Then, writing we have the linear system
The second step is a policy improvement step. Given a value function , find an ‘improved’ policy with respect to such that
Thus, policy iteration produces a sequence of policies and as follows. At iteration , we solve the linear system for , and then we choose a new policy satisfying
which is greedy with respect to . We have a linear convergence rate for policy iteration as well. Let be any value function, solve for , and then compute . Then, we know [9, Lemma 6.2] that
from which convergence of policy iteration follows. Unless otherwise specified, the norm we will use in this paper is the sup norm.
We use the following helpful fact in the paper. Proof is given in Appendix id1.
Let be a given set, and and be two real-valued functions on . Then,
(i) , and
We now present empirical variants of dynamic programming algorithms. Our focus will be on value and policy iteration. As the reader will see, the idea is simple and natural. In subsequent sections we will introduce the new notions and techniques to prove their convergence.
We introduce empirical value iteration (EVI) first.
The Bellman operator requires exact evaluation of the expectation
We will simulate and replace this exact expectation with an empirical estimate in each iteration. Thus, we need a simulation model for the MDP. Let
be a simulation model for the state evolution for the MDP, i.e. yields the next state given the current state, the action taken and an i.i.d. random variable. Without loss of generality, we can assume that is a uniform random variable on and . With this convention, the Bellman operator can be written as
Now, we replace the expectation with its sample average approximation by simulating . Given i.i.d. samples of a uniform random variable, denoted , the empirical estimate of is . We note that the samples are regenerated at each iteration. Thus, the EVI algorithm can be summarized as follows.
Input: , sample size .
Set counter .
Sample uniformly distributed random variables from , and compute
Increment and return to step 2.
In each iteration, we regenerate samples and use this empirical estimate to approximate . Now we give the sample complexity of the EVI algorithm. Proof is given in Section id1.
Given and , fix and select such that where . Select an such that
where and select a such that
where and is given by Lemma 4.3. Then
This result says that, if we take samples in each iteration of the EVI algorithm and perform iterations then the EVI iterate is close to the optimal value function with probability greater that . We note that the sample complexity is .
The basic idea in the analysis is to frame EVI as iteration of a random operator which we call the empirical Bellman operator. We define as
This is a random operator because it depends on the random noise samples . The definition and the analysis of this operator is done rigorously in Section id1.
We now define EPI along the same lines by replacing exact policy improvement and evaluation with empirical estimates. For a fixed policy , we can estimate via simulation. Given a sequence of noise , we have for all . For , choose a finite horizon such that
We use the time horizon to truncate simulation, since we must stop simulation after finite time. Let
be the realization of on the sample path .
The next algorithm requires two input parameters, and , which determine sample sizes. Parameter is the sample size for policy improvement and parameter is the sample size for policy evaluation. We discuss the choices of these parameters in detail later. In the following algorithm, the notation is understood as the state at time in the simulated trajectory .
Input: , .
Set counter .
For each , draw and compute
Draw . Choose to satisfy
Increase and return to step 2.
Stop when .
Step 2 replaces computation of (policy improvement). Step 3 replaces solution of the system (policy evaluation).
We now give the sample complexity result for EPI. Proof is given in Section id1.
Given , select such that . Also select such that . Then, select such that where . Then, select a and such that
where , and select a such that
where and is given by equation (5.6). Then,
This result says that, if we do simulation runs for empirical policy evaluation, samples for empirical policy update and perform iterations then the true value of the policy will be close to the optimal value function with probability greater that . We note that is and is .
The empirical Bellman operator we defined in equation (3.1) is a random operator. When it operates on a vector, it yields a random vector. When is iterated, it produces a stochastic process and we are interested in the possible convergence of this stochastic process. The underlying assumption is that the random operator is an ‘approximation’ of a deterministic operator such that converges to (in a sense we will shortly make precise) as increases. For example the empirical Bellman operator approximates the classical Bellman operator. We make this intuition mathematically rigorous in this section. The discussion in this section is not specific to the Bellman operator, but applies whenever a deterministic operator is being approximated by an improving sequence of random operators .
In this subsection we formalize the definition of a random operator, denoted by .
Since is a random operator, we need an appropriate probability space upon which to define . So, we define the sample space , the algebra where is the inherited Borel algebra on , and the probability distribution on formed by an infinite sequence of uniform random variables. The primitive uncertainties on are infinite sequences of uniform noise where each is an independent uniform random variable on . We view as the appropriate probability space on which to define iteration of the random operators .
Next we define a composition of random operators, , on the probability space , for all and all where,
Note that is an infinite sequence where each . Then we can define the iteration of with an initial seed (we use the hat notation to emphasize that the iterates are random variables generated by the empirical operator) as
Notice that we only iterate for fixed . The sample size is constant in every stochastic process , where , for all . For a fixed , we can view all as measurable mappings from to via the mapping .
The relationship between the fixed points of the deterministic operator and the probabilistic fixed points of the random operator depends on how approximates . Motivated by the relationship between the classical and the empirical Bellman operator, we will make the following assumption.
and . Also has a (possibly non-unique) fixed point such that .
Assumption 4.1 is equivalent to for almost all . Here, we benefit from defining all of the random operators together on the sample space , so that the above convergence statement makes sense.
We now introduce a natural probabilistic fixed point notion for , in analogy to the definition of a fixed point, for a deterministic operator.
A vector is a strong probabilistic fixed point for the sequence if
We note that the above notion is defined for a sequence of random operators, rather than for a single random operator.
We can give a slightly more general notion of a probabilistic fixed point which we call -strong probablistic fixed point. For a fixed , we say that a vector is an -strong probabilistic fixed point if there exists an such that for all we get . Note that, all strong probabilistic fixed points satisfy this condition for arbitrary and hence are -strong probabilistic fixed points. However the converse need not be true. In many cases we may be looking for an -optimal ‘solution’ with a ‘probabilistic guarantee’ where is fixed a priori. In fact, this would provide an approximation to the strong probabilistic fixed point of the sequence of operators.
It is well known that iteration of a deterministic contraction operator converges to its fixed point. It is unclear whether a similar property would hold for random operators, and whether they would converge to the strong probabilistic fixed point of the sequence in any way. Thus, we define an apparently weaker notion of a probabilistic fixed point that explicitly considers iteration.
A vector is a weak probabilistic fixed point for if
We use instead of because the latter limit may not exist for any fixed .
Similar to the definition that we gave in Remark 4.1, we can define an -weak probablistic fixed point. For a fixed , we say that a vector is an -weak probabilistic fixed point if there exists an such that for all we get . As before, all weak probabilistic fixed points are indeed weak probabilistic fixed points, but converse need not be true.
At this point the connection between strong/weak probabilistic fixed points of the random operator and the classical fixed point of the deterministic operator is not clear. Also it is not clear whether the random sequence converges to either of these two fixed point. In the following subsections we address these issues.
In this subsection, we construct a new stochastic process on that will be useful in our analysis. We first start with a simple lemma.
The stochastic process is a Markov chain on .
This follows from the fact that each iteration of is independent, and identically distributed. Thus, the next iterate only depends on history through the current iterate . \@endparenv
Even though is a Markov chain, its analysis is complicated by two factors. First, is a Markov chain on the continuous state space , which introduces technical difficulties in general when compared to a discrete state space. Second, the transition probabilities of are too complicated to compute explicitly.
Since we are approximating by and want to compute , we should track the progress of to the fixed point of . Equivalently, we are interested in the real-valued stochastic process . If approaches zero then approaches , and vice versa.
The state space of the stochastic process is , which is simpler than the state space of , but which is still continuous. Moreover, is a non-Markovian process in general. In fact it would be easier to study a related stochastic process on a discrete, and ideally a finite state space. In this subsection we show how this can be done.
We make a boundedness assumption next.
There exists a such that almost surely for all , . Also, .
Under this assumption we can restrict the stochastic process to the compact state space
We will adopt the convention that any element outside of will be mapped to its projection onto by any realization of .
Choose a granularity to be fixed for the remainder of this discussion. We will break up into intervals of length starting at zero, and we will note which interval is occupied by at each . We will define a new stochastic process on with state space . The idea is that will report which interval of is occupied by . Define via the rule:
for all . More compactly,
where denotes the smallest integer greater than or equal to . Thus the stochastic process is a report on how close the stochastic process is to zero, and in turn how close the Markov chain is to the true fixed point of .
Define the constant
Notice is the smallest number of intervals of length needed to cover the interval . By construction, the stochastic process is restricted to the finite state space
The process need not be a Markov chain. However, it is easier to work with than either or because it has a discrete state space. It is also easy to relate back to .
Recall that denotes almost sure inequality between two random variables and defined on the same probability space. The stochastic processes and are defined on the same probability space, so the next lemma follows by construction of .
For all , .
To proceed, we will make the following assumptions about the deterministic operator and the random operator .
for all .
There is a sequence such that
and as for all , .
We now discuss the convergence rate of . Let . On the event , we have
We conclude this subsection with a comment about the state space of the stochastic process of . If we start with and if then we must have improvement in the proximity of to . We define a new constant
If is too small, then may be equal to and no improvement in the proximity of to can be detected by . For any , and strict improvement must hold. So, for the stochastic process , we can restrict our attention to the state space .
If we could understand the behavior of the stochastic processes , then we could make statements about the convergence of and . Although simpler than and , the stochastic process is still too complicated to work with analytically. We overcome this difficulty with a family of dominating Markov chains. We now present our dominance argument. Several technical details are expanded upon in the appendix.
We will denote our family of “dominating” Markov chains (MC) by . We will construct these Markov chains to be tractable and to help us analyze . Notice that the family has explicit dependence on . We do not necessarily construct on the probability space . Rather, we view as being defined on , the canonical measurable space of trajectories on , so . We will use to denote the probability measure of on . Since will be a Markov chain by construction, the probability measure will be completely determined by an initial distribution on and a transition kernel. We denote the transition kernel of as .
Our specific choice for is motivated by analytical expediency, though the reader will see that many other choices are possible. We now construct the process explicitly, and then compute its steady state probabilities and its mixing time. We will define the stochastic process on the finite state space , based on our observations about the boundedness of and . Now, for a fixed and (we drop the argument in the following for notational convenience) as assumed in Assumption 4.4 we construct the dominating Markov chain as:
The first value corresponds to the case where the approximation error satisfies , and the second value corresponds to all other cases (giving us an extremely conservative bound in the sequel). This construction also ensures that for all , almost surely. Informally, will either move one unit closer to zero until it reaches , or it will move (as far away from zero as possible) to .
We now summarize some key properties of .
For as defined above,
(i) it is a Markov chain;
(ii) the steady state distribution of , and the limit , exists;
(iii) as for all ;
Parts (i) - (iii) follow by construction of and the fact that this family consists of irreducible Markov chains on a finite state space. \@endparenv
We now describe a stochastic dominance relationship between the two stochastic processes and . The notion of stochastic dominance (in the usual sense) will be central to our development.
Let and be two real-valued random variables, then stochastically dominates , written , when for all increasing functions .
The condition is known to be equivalent to
for all in the support of . Notice that the relation makes no mention of the respective probability spaces on which and are defined - these spaces may be the same or different (in our case they are different).
Let be the filtration on