A Martingale Approach and Time-Consistent Sampling-based Algorithms for Risk Management in Stochastic Optimal Control

# A Martingale Approach and Time-Consistent Sampling-based Algorithms for Risk Management in Stochastic Optimal Control

Vu Anh Huynh    Leonid Kogan    Emilio Frazzoli Huynh and Frazzoli are affiliated with or members of the Laboratory for Information and Decision Systems, Kogan is with the Sloan School of Management, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, MA 02139. huyn0002@gmail.com,lkogan2, frazzoli@mit.edu
###### Abstract

In this paper, we consider a class of stochastic optimal control problems with risk constraints that are expressed as bounded probabilities of failure for particular initial states. We present here a martingale approach that diffuses a risk constraint into a martingale to construct time-consistent control policies. The martingale stands for the level of risk tolerance that is contingent on available information over time. By augmenting the system dynamics with the controlled martingale, the original risk-constrained problem is transformed into a stochastic target problem. We extend the incremental Markov Decision Process (iMDP) algorithm to approximate arbitrarily well an optimal feedback policy of the original problem by sampling in the augmented state space and computing proper boundary conditions for the reformulated problem. We show that the algorithm is both probabilistically sound and asymptotically optimal. The performance of the proposed algorithm is demonstrated on motion planning and control problems subject to bounded probability of collision in uncertain cluttered environments.

## I Introduction

Controlling dynamical systems in uncertain environments is a fundamental and essential problem in several fields, ranging from robotics [1, 2], healthcare [3, 4] to management science, economics and finance [5, 6]. Given a system with dynamics described by a controlled diffusion process, a stochastic optimal control problem is to find an optimal feedback policy to optimize an objective function. Risk management has always been an important part of stochastic optimal control problems to guarantee safety during the execution of control policies. For instance, in critical applications such as self-driving cars and robotic surgery, regulatory authorities can impose a threshold of failure probability during operation of these systems. Thus, finding control policies that fully respect this type of constraint is important in practice.

There has been intensive literature on stochastic optimal control without risk constraints. Even in this setting, it is well-known that closed-form or exact algorithmic solutions for general continuous-time, continuous-space stochastic optimal control problems are computationally challenging [7]. Thus, many approaches have been proposed to investigate approximate solutions of such problems. Deterministic approaches such as discrete Markov Decision Process approximation [8, 9] and solving associated Hamilton-Jacobi-Bellman (HJB) PDEs [10, 11, 12] have been proposed, but the complexities of these approaches scale poorly with the dimension of the state space. In [13, 14, 7], the authors show that randomized algorithms (or sampling-based algorithms) provide a possibility to alleviate the curse of dimensionality by sampling the state space while assuming discrete control inputs. Recently, in [15, 16], a new computationally-efficient sampling-based algorithm called the incremental Markov Decision Process (iMDP) algorithm has been proposed to provide asymptotically-optimal solutions to problems with continuous control spaces.

Built upon the approximating Markov chain method [17, 18], the iMDP algorithm constructs a sequence of finite-state Markov Decision Processes (MDPs) that consistently approximate the original continuous-time stochastic dynamics. Using the rapidly-exploring sampling technique [19] to sample in the state space, iMDP forms the structures of finite-state MDPs randomly over iterations. Control sets for states in these MDPs are constructed or sampled properly in the control space. The finite models serve as incrementally refined models of the original problem. Consequently, distributions of approximating trajectories and control processes returned from these finite models approximate arbitrarily well distributions of optimal trajectories and optimal control processes of the original problem. The iMDP algorithm also maintains low time complexity per iteration by asynchronously computing Bellman updates in each iteration. There are two main advantages when using the iMDP algorithm to solve stochastic optimal control problems. First, the iMDP algorithm provides a method to compute optimal control policies without the need to derive and characterize viscosity solutions of associated HJB equations. Second, the algorithm is suitable for various online robotics applications without a priori discretization of the state space.

Risk management in stochastic optimal control has also been received extensive attention by researchers in several fields. In robotics, a common risk management problem is chance-constrained optimization [20, 21, 22]. Chance constraints specify that starting from a given initial state, the time- probability of success must be above a given threshold where success means reaching goal areas safely. Alternatively, we call these constraints risk constraints if we concern more about failure probabilities. Despite intensive work done to solve this problem in last 20 years, designing computationally-efficient algorithms that respect chance constraints for systems with continuous-time dynamics is still an open question. The Lagrangian approach [23, 24, 25] is a possible method for solving the mentioned constrained optimization. However, this approach requires numerical procedures to compute Lagrange multipliers before obtaining a policy, which is computationally demanding for high dimensional systems and unsuitable for online robotics applications.

In another approach (see, e.g., [26, 27, 28, 29, 30]), most previous works use discrete-time multi-stage formulations to model this problem. In these modified formulations, failure is defined as collision with convex obstacles which can be represented as a set of linear inequalities. Probabilities of safety for states at different time instants as well as for the entire path are pre-specified by users. The proposed algorithms to solve these formulations often involve two main steps. In the first step, these algorithms often use heuristic [26] or iterative [27] risk allocation procedures to identify the tightness of different constraints. In the second step, the formulations with identified active constraints can be solved using mixed integer-linear programming with possible assistance of particle sampling [20] and linear programming relaxation [21]. Computing risk allocation fully is computationally intensive. Thus, in more recent works [28, 29, 30], the authors make use of the Rapidly-Exploring Random Tree (RRT) and RRT algorithms to build tree data structures that also store incremental approximate allocated risks at tree nodes. Based on the RRT algorithm, the authors have proposed the Chance-Constrained-RRT (CC-RRT) algorithm that would provide asymptotically-optimal and probabilistically-feasible trajectories for linear Gaussian systems subject to process noise, localization error, and uncertain environmental constraints. In addition, the authors have also proposed a new objective function that allows users to trade-off between minimizing path duration and risk-averse behavior by adjusting the weights of these additive components in the objective function.

We note that the modified formulations in the above approach do not preserve well the intended guarantees of the original chance constraint formulation. In addition, the approach requires the direct representation of convex obstacles into the formulations. Therefore, solving the resulting mixed integer-linear programming in the presence of a large number of obstacles is computationally demanding. The proposed algorithms are also over-conservative due to loose union bounds when performing the risk allocation procedures. To counter these conservative bounds, CC-RRT constructs more aggressive trajectories by adjusting the weights of the path duration and risk-averse components in the objective function. As a result, it is hard to automate the selection of trajectory patterns.

Moreover, specifying in advance probabilities of safety for states at different time instants and for the entire path can lead to policies that have irrational behaviors due inconsistent risk preference over time. This phenomenon is known as time-inconsistency of control policies. For example, when we execute a control policy returned by one of the proposed algorithms, due to noise, the system can be in an area surrounded by obstacles at some later time , it would be safer if the controller takes into account this situation and increases the required probability of safety at time to encourage careful maneuvers. Similarly, if the system enters an obstacle-free area, the controller can reduce the required probability of safety at time to encourage more aggressive maneuvers. Therefore, to maintain time-consistency of control policies, the controller should adjust safety probabilities so that they are contingent on available information along the controlled trajectory.

In other related works [31, 32, 33], several authors have proposed new formulations in which the objective functions and constraints are evaluated using (different) single-period risk metrics. However, these formulations again lead to potential inconsistent behaviors as risk preferences change in an irrational manner between periods [34]. Recently, in [22], the authors used Markov dynamic time-consistent risk measures [35, 36, 37] to assess the risk of future cost stream in a consistent manner and established a dynamic programming equation for this modified formulation. The resulting dynamic programming equation has functionals over the state space as control variables. When the state space is continuous, the control space has infinite dimensionality, and therefore, solving the dynamic programming equation in this case is computationally challenging.

In mathematical finance, closely-related problems have been studied in the context of hedging with portfolio constraints where constraints on terminal states are enforced almost surely (a.s.), yielding so-called stochastic target problems [38, 39, 40, 41, 42]. Research in this field focuses on deriving HJB equations for this class of problems. Recent analytical tools such as weak dynamic programming [38] and geometric dynamic programming [43, 44] have been developed to achieve this goal. These tools allow us to derive HJB equations and find viscosity solutions for a larger class of problems while avoiding measurability issues.

In this paper, we consider the above risk-constrained problems. That is, we investigate stochastic optimal control problems with risk constraints that are expressed in terms of bounded failure probabilities for particular initial states. We present here a martingale approach to solve these problems such that obtained control policies are time-consistent with the initial threshold of failure probability. The martingale represents the level of risk tolerance that is contingent on available information over time. Thus, the martingale approach transforms a risk-constrained problem into a stochastic target problem. By sampling in the augmented state space and computing proper boundary conditions of the reformulated problem, we extend the iMDP algorithm to compute anytime solutions after a small number of iterations. When more computing time is allowed, the proposed algorithm refines the solution quality in an efficient manner.

The main contribution of this paper is twofold. First, we present a novel martingale approach that fully respects the considered risk constraints for systems with continuous-time dynamics in a time-consistent manner. The approach enable us to manage risk in several practical robotics applications without directly deriving HJB equations, which are hard to obtain in many situations. Second, we propose a computationally-efficient algorithm that guarantees probabilistically-sound and asymptotically-optimal solutions to the stochastic optimal control problem in the presence of risk constraints. That is, all constraints are satisfied in a suitable sense, and the objective function is minimized as the number of iterations approaches infinity. We demonstrate the effectiveness of the proposed algorithm on motion planning and control problems subject to bounded collision probability in uncertain cluttered environments.

This paper is organized as follows. A formal problem definition is given in Section II. In Section III, we discuss the martingale approach and the key transformation. The extended iMDP algorithm is described in Section IV. The analysis of the proposed algorithm is presented in Section V. We present simulation examples and experimental results in Section VI and conclude the paper in Section VII.

## Ii Problem Definition

In this section, we present a generic stochastic optimal control formulation with definitions and technical assumptions as discussed in [15, 16, 45]. We also explain how to formulate risk constraints.

##### Stochastic Dynamics

Let , , and be positive integers. Let be a compact subset of , which is the closure of its interior and has a smooth boundary . Let a compact subset of be a control set. The state of the system at time is , which is fully observable at all times.

Suppose that a stochastic process is a -dimensional Brownian motion on some probability space. We define as the augmented filtration generated by the Brownian motion . Let a control process be a -valued, measurable random process also defined on the same probability space such that the pair is admissible [15]. Let the set of all such control processes be . Let denote the set of all by real matrices. We consider systems with dynamics described by the controlled diffusion process:

 dx(t)=f(x(t),u(t))dt+F(x(t),u(t))dw(t),∀t≥0 (1)

where and are bounded measurable and continuous functions as long as . The initial state is a random vector in . We assume that the matrix has full rank. The continuity requirement of and can be relaxed with mild assumptions [17, 15] such that we still have a weak solution to Eq. (1) that is unique in the weak sense [46].

##### Cost-to-go Function and Risk Constraints

We define the first exit time under a control process starting from as

 Tzu=inf{t:x(0)=z, x(t)∉So, and Eq.(???)}.

In other words, is the first time that the trajectory of the dynamical system given by Eq. (1) starting from hits the boundary of . The random variable can take value if the trajectory never exits .

The expected cost-to-go function under a control process is a mapping from to defined as

 \resizebox422.7795pt$Ju(z)=Ez0[∫Tzu0αtg(x(t),u(t))dt+αTzuh(x(Tzu))]$, (2)

where denotes the conditional expectation given , and , are bounded measurable and continuous functions, called the cost rate function and the terminal cost function, respectively, and is the discount rate. We further assume that is uniformly Hölder continuous in with exponent for all . We note that the discontinuity of can be treated as in [17, 15].

Let be a set of failure states, and be a threshold for risk tolerance given as a parameter. We consider a risk constraint that is specified for an initial state under a control process as follows:

 Pz0(x(Tzu)∈Γ)≤η,

where denotes the conditional probability at time given . That is, controls that drive the system from time until the first exit time must be consistent with the choice of and initial state at time . Intuitively, the constraint enforces that starting from a given state at time , if we execute a control process for times, when is very large, there are at most executions resulting in failure. Control processes that satisfy this constraint are called time-consistent. To have time-consistent control processes, the risk tolerance along controlled trajectories must vary consistently with the initial choice of risk tolerance based on available information over time.

Let be the extended real number set. The optimal cost-to-go function is defined as follows 111The semicolon in signifies that is a parameter. 222Compared to [45], we consider a larger set of control processes than the set of Markov control processes here. We will restrict again to Markov control processes in the reformulated problem.:

 OPT1: J∗(z;η)=infu(⋅)∈UJu(z) (3) s/t Pz0(x(Tzu)∈Γ)≤η  and \ Eq.~{% }(???). (4)

A control process is called optimal if . For any , a control process is called an -optimal policy if .

We call a sampling-based algorithm probabilistically-sound if the probability that a solution returned by the algorithm is feasible approaches one as the number of samples increases. We also call a sampling-based algorithm asymptotically-optimal if the sequence of solutions returned from the algorithm converges to an optimal solution in probability as the number of samples approaches infinity. Solutions returned from algorithms with such properties are called probabilistically-sound and asymptotically-optimal.

In this paper, we consider the problem of computing the optimal cost-to-go function and an optimal control process if obtainable. Our approach, outlined in Section IV, approximates the optimal cost-to-go function and an optimal policy in an anytime fashion using an incremental sampling-based algorithm that is both probabilistically-sound and asymptotically-optimal.

## Iii Martingale Approach

We now present the martingale approach that transforms the considered risk-constrained problem into an equivalent stochastic target problem. The following lemma to diffuse risk constraints is a key tool for our transformation.

### Iii-a Diffusing Risk Constraints

###### Lemma 1 (see [41, 42])

From , a control process is feasible for if and only if there exists a square-integrable (but possibly unbounded) process and a martingale satisfying:

1. , and ,

2. For all , a.s.,

3. a.s,

where if and only if and otherwise. The martingale stands for the level of risk tolerance at time . We call a martingale control process.

###### Proof.

Assuming that there exists and as above, due to the martingale property of , we have:

 Pz0(x(Tzu)∈Γ) =E[1Γ(x(Tzu))|F0] ≤E[q(Tzu)|F0]=q(0)=η.

Thus, is feasible.

Now, let be a feasible control policy. Set . We note that . We define the martingale

 ¯¯¯q(t)=E[1Γ(x(Tzu))|Ft].

Since , we infer that almost surely. We now set

 ˆq(t)=¯¯¯q(t)+(η−η0),

then is a martingale with and almost surely.

Now, we define , which is a stopping time. Thus,

 q(t)=ˆq(t)1t≤τ+1t>τ,

as a stopped process of the martingale at , is a martingale with values in [0,1] a.s.

If , we have

 1Γ(x(Tzu))≤1=q(Tzu),

and if , we have

 q(Tzu) =E[1Γ(x(Tzu))|FTzu]+(η−η0) =1Γ(x(Tzu))+(η−η0)≥1Γ(x(Tzu)).

Hence, also satisfies that .

The control process exists due to the martingale representation theorem [47], which yields . We however note that is possibly unbounded. We also emphasize that the risk tolerance becomes the initial value of the martingale . ∎

### Iii-B Stochastic Target Problem

Using the above lemma, we augment the original system dynamics with the martingale into the following form:

 (5)

where is the control process of the above dynamics. The initial value of the new state is . We will refer to the augmented state space as and the augmented control space as . We also refer to the nominal dynamics and diffusion matrix of Eq. (5) as and respectively.

It is well-known that in the following reformulated problem, optimal control processes are Markov controls [41, 42, 48]. Thus, let us now focus on the set of Markov controls that depend only on the current state, i.e., is a function only of , for all . A function represents a Markov or feedback control policy, which is known to be admissible with respect to the process noise . Let be the set of all such policies . Let and so that . We rename to for the sake of notation clarity. Using these notations, is thus a Markov control policy for the unconstrained problem, i.e. the problem without the risk constraint, that maps from to . Henceforth, we will use to refer to when it is clear from the context. Let be the set of all such Markov control policies on .

Now, let us rewrite cost-to-go function in Eq. (2) for the threshold at time in a new form:

 Jφ(z,η) +αTzφh(x(Tzφ))∣∣(x,q)(0)=(z,η)]. (6)

We therefore transform the risk-constrained problem into a stochastic target problem as follows333The comma in signifies that is a state component rather than a parameter, and is equal to in the previous formulation.:

 OPT2: J∗(z,η)=infφ∈ΨJφ(z,η) (7) s/t 1Γ(x(Tzφ))≤q(Tzφ)  a.s.  and \ Eq.~{}(???). (8)

The constraint in the above formulation specifies the relationship of random variables at the terminal time as target, and hence the name of this formulation [41, 42]. In this formulation, we solve for feedback control policies for all instead of a particular choice of for at time . We note that in this formulation, boundary conditions are not fully specified a priori. In the following subsection, we discuss how to remove the constraint in Eq. (8) by constructing its boundary and computing the boundary conditions.

### Iii-C Characterization and Boundary Conditions

The domain of the stochastic target problem in is:

 D={(z,η)∈¯¯¯¯S | ∃φ∈Ψ s/t 1Γ(x(Tzφ))≤q(Tzφ) a.s.}.

By the definition of the risk-constrained problem , we can see that if then for any . Thus, for each , we define

 γ(z)=inf {η∈[0,1] | (z,η)∈D}, (9)

as the infimum of risk tolerance at . Therefore, we also have:

 γ(z)=infu∈UPz0(x(Tzu)∈Γ)=infu∈UEz0[1Γ(x(Tzu))]. (10)

Thus, the boundary of is

 ∂D= S×{1}∪{(z,γ(z)) | z∈S} ∪{(z,η) | z∈∂S,η∈[γ(z),1]}. (11)

For states in , the system stops on and takes terminal values according to .

The domain is illustrated in Fig. 1(a). In this example, the state space is a bounded two-dimensional area with boundary containing a goal region and an obstacle region . The augmented state space augments with an extra dimension for the martingale state . The infimum probability of reaching into from states in is depicted as . As we can see, takes value in . The volume between and the hyper-plane is the domain of .

Now, let , we notice that is the optimal cost-to-go from for the stochastic optimal problem without the risk constraint:

 J∗(z,1)=infu∈UJu(z).

An optimal control process that solves this optimization problem is given by a Markov policy . We now define the failure probability function under such an optimal policy as follows:

 (12)

where is the first exit time when the system follows the control policy from the initial state . By the definitions of and , we can recognize that for all . Figure 1(b) shows an illustration of for the same example in Fig. 1(a).

Since following the policy from an initial state yields a failure probability , we infer that:

 J∗(z,1)=J∗(z,Υ(z)). (13)

From the definition of the problem , we also have:

 0≤η<η′≤1⇒J∗(z,η)≥J∗(z,η′). (14)

Thus, for any , we have:

 J∗(z,1)≤J∗(z,η)≤J∗(z,Υ(z)). (15)

Combining Eq. (13) and Eq. (15), we have:

 ∀ η∈[Υ(z),1]⇒J∗(z,η)=J∗(z,1). (16)

As a consequence, when we start from an initial state with a risk threshold that is at least , it is optimal to execute an optimal control policy of the corresponding unconstrained problem from the initial state .

It also follows from Eq. (14) that reducing the risk tolerance from along the controlled process can not reduce the optimal cost-to-go function evaluated at . Thus, we infer that for augmented states where , the optimal martingale control is 0.

Now, under all admissible policies , we can not obtain a failure probability for an initial state that are lower than . Thus, it is clear that for all . The following lemma characterizes the optimal martingale control for augmented states .

###### Lemma 2

Given the problem definition as in Eqs. (3)-(4), we assume that is a smooth function444When is not smooth, we need the concept of viscosity solutions and weak dynamic programming principle. See [41, 42] for details.. When and is chosen, we must have:

 c(t)T=∂γ∂x(t)TF(x(t),u(t)). (17)
###### Proof.

Using the geometric dynamic programming principle [43, 44], we have the following result: for all stopping time , when , a feasible control policy satisfies almost surely.

Take , under a feasible control policy , we have a.s. for all , and hence a.s. By It lemma, we derive the following relationship:

 cT(t)dw(t) ≥∂γ∂xT(f(x(t),u(t))dt+F(x(t),u(t))dw(t)) +12 Tr(F(x(t),u(t))F(x(t),u(t))T∂2γ(∂x)2)dt a.s.

For the above inequality to hold almost surely, the coefficient of must be . This leads to Eq. (17).

In addition, if a control process that solves Eq. (10) is obtainable, say , the cost-to-go due to that control process is . We will conveniently refer to as . Under the mild assumption that is unique, it follows that .

We also emphasize that when is inside the interior of , the usual dynamic programming principle holds. The extension of iMDP outlined below is designed to compute the sequence of approximate cost-to-go values on the boundary and in the interior .

## Iv Algorithm

In this section, we briefly overview how the Markov chain approximation technique is used in both the original and augmented state spaces. We then present the extended iMDP algorithm that incrementally constructs the boundary values and computes solutions to our problem. In particular, we sample in the original state space to compute and its induced collision probability as in Eq. (12), the min-failure probability as in Eq. (10) and its induced cost-to-go . Concurrently, we also sample in the augmented state space with appropriate values for samples on the boundary of and approximate the optimal cost-to-go function in the interior . As a result, we construct a sequence of anytime control policies to approximate an optimal control policy in an efficient iterative procedure.

### Iv-a Markov Chain Approximation

A discrete-state Markov decision process (MDP) is a tuple where is a finite set of states, is a set of actions that is possibly a continuous space, is the transition probability function, is an immediate cost function, and is a terminal cost function. From an initial state , under a sequence of controls , the induced trajectory is generated by following the transition probability function .

On the state space , we want to approximate , , and for any state , and it is suffice to consider optimal Markov controls as shown in [15, 16]. The Markov chain approximation method approximates the continuous dynamics in Eq. (1) using a sequence of MDPs and a sequence of holding times that are locally consistent. In particular, we construct , for each and . We also require that where is the sample space of , , and

• For all , ,

• For all and all :

 limn→∞EPn[Δξni|ξni=z,uni=v]Δtn(z) = f(z,v), limn→∞CovPn[Δξni|ξni=z,uni=v]Δtn(z) = F(z,v)F(z,v)T.

The main idea of the Markov chain approximation approach for solving the original continuous problem is to solve a sequence of control problems defined on as follows. A Markov or feedback policy is a function that maps each state to a control . The set of all such policies is . We define for and . Given a policy that approximates a Markov control process in Eq. (2), the corresponding cost-to-go due to on is:

 Jn,μn(z)=EzPn[In−1∑i=0αtniGn(ξni,μn(ξni))+αtnInHn(ξnIn)],

where denotes the conditional expectation given under , and is the sequence of states of the controlled Markov chain under the policy , and is termination time defined as where .

The optimal cost-to-go function that approximates is denoted as

 J∗n(z,1)=infμn∈ΠnJn,μn(z) ∀z∈Sn. (18)

An optimal policy, denoted by , satisfies for all . For any , is an -optimal policy if .

We also define the failure probability function due to an optimal policy as follows:

 Υn(z)=EPn[1Γ(ξnIn) ∣∣ x(0)=z ; μ∗n ] ∀z∈Sn, (19)

where we denote after the semicolon (as a parameter) to emphasize the dependence of the Markov chain on this control policy.

In addition, the min-failure probability on that approximates is defined as:

 (20)

We note that the optimization programs in Eq. (18) and Eq. (20) may have two different optimal feedback control policies. Let be a control policy on that achieves , then the cost-to-go function due to is which approximates . For this reason, we conveniently refer to as .

Similarly, in the augmented state space , we use a sequence of MDPs and a sequence of holding times that are locally consistent with the augmented dynamics in Eq. (5). In particular, is a random subset of , is identical to , and is equal to if and otherwise. Similar to the construction of and , we also construct the transition probabilities on and holding time that satisfy the local consistency conditions for nominal dynamics and diffusion matrix .

A trajectory on is denoted as where . A Markov policy is a function that maps each state to a control . Moreover, admissible at is and at is a function of as shown in Eq. (17). Admissible for other states in is such that the martingale-component process of belongs to [0,1] almost surely. We can show that equivalently, each control component of belongs to . The set of all such policies is .

Under a control policy , the cost-to-go on that approximates Eq. (6) is defined as:

where for with , and is index when the -component of first arrives at . The approximating optimal cost for in Eq. (7) is:

 J∗n(z,η)=infφn∈ΨnJn,φn(z,η) ∀(z,η)∈¯¯¯¯Sn. (21)

To solve the above optimization, we compute approximate boundary values for states on the boundary of using the sequence of MDP on as discussed above. For states , the normal dynamic programming principle holds.

The extension of iMDP outlined below is designed to compute the sequence of optimal cost-to-go functions , associated failure probability functions , min-failure probability functions , min-failure cost functions , and the sequence of anytime control policies and in an incremental procedure.

### Iv-B Extension of iMDP

Before presenting the details of the algorithm, we discuss a number of primitive procedures. More details about these procedures can be found in [15, 16].

#### Iv-B1 Sampling

The procedure sample states independently and uniformly in .

#### Iv-B2 Nearest Neighbors

Given and a set , for any , the procedure returns the nearest states that are closest to in terms of the -dimensional Euclidean norm.

#### Iv-B3 Time Intervals

Given a state and a number , the procedure returns a holding time computed as follows: where is a constant, and are constants in and respectively. The parameter defines the Hölder continuity of the cost rate function as in Section II.

#### Iv-B4 Transition Probabilities

We are given a state , a subset , a control in some control set , a positive number describing a holding time, is a nominal dynamics, is a diffusion matrix. The procedure returns (i) a finite set of states such that the state belongs to the convex hull of and for all , and (ii) a function that maps to a non-negative real numbers such that is a probability distribution over the support . It is crucial to ensure that these transition probabilities result in a sequence of locally consistent chains that approximate and as presented in [17, 15, 16].

#### Iv-B5 Backward Extension

Given and two states , the procedure returns a triple such that (i) and for all , (ii) , (iii) for all , (iv) , and (v) is close to . If no such trajectory exists, the procedure returns failure. We can solve for the triple by sampling several controls and choose the control resulting in that is closest to .

When are in , the procedure returns in which is output of and is sampled according to a Gaussian distribution where is a parameter.

#### Iv-B6 Sampling and Discovering Controls

For and , the procedure returns a set of controls in . We can uniformly sample controls in . Alternatively, for each state , we solve for a control such that (i) and for all , (ii) for all , (iii) and .

For and , the procedure returns a set of controls in such that the -component of these controls are computed as in , and the martingale-control-components of these controls are sampled in admissible sets.