# Probably Approximately Correct Vision-Based Planning using Motion Primitives

## Abstract

This paper presents a deep reinforcement learning approach for synthesizing vision-based planners that provably generalize to novel environments (i.e., environments unseen during training). We leverage the Probably Approximately Correct (PAC)-Bayes framework to obtain an upper bound on the expected cost of policies across all environments. Minimizing the PAC-Bayes upper bound thus trains policies that are accompanied by a certificate of performance on novel environments. The training pipeline we propose provides strong generalization guarantees for deep neural network policies by (a) obtaining a good prior distribution on the space of policies using Evolutionary Strategies (ES) followed by (b) formulating the PAC-Bayes optimization as an efficiently-solvable parametric convex optimization problem. We demonstrate the efficacy of our approach for producing strong generalization guarantees for learned vision-based motion planners through two simulated examples: (1) an Unmanned Aerial Vehicle (UAV) navigating obstacle fields with an onboard vision sensor, and (2) a dynamic quadrupedal robot traversing rough terrains with proprioceptive and exteroceptive sensors.

## I Introduction

Imagine an unmanned aerial vehicle (UAV) navigating through a dense environment using an RGB-D sensor (Figure 1). Vision-based planning of this kind has been the subject of decades of work in the robotics literature. Traditional approaches to this problem use RGB-D sensors to perform state estimation and create a (potentially local) map of obstacles in the environment; the resulting state estimate and map are used in conjunction with motion planning techniques (e.g., rapidly-exploring random trees [16] or motion primitives [31, 2]) to perform navigation [31, 26, 21]. Recent approaches have sought to harness the power of deep learning in order to forego explicit geometric representations of the environment and learn to perform vision-based planning [2, 33, 12]. These approaches learn to map raw sensory inputs to a latent representation of the robot’s state and environment using neural networks; planning is performed in this latent space. Learning-based approaches to planning have two primary advantages over more traditional methods: (1) the use of convolutional neural networks allows one to elegantly handle RGB-D inputs, and (2) one can learn to exploit statistical regularities of natural environments to improve planning performance. However, deep learning-based approaches to planning currently provide no explicit guarantees on generalization performance. In other words, such approaches are unable to provide bounds on the performance of the learned planner when placed in a novel environment (i.e., an environment that was not seen during training).

The goal of this paper is to address this challenge by developing an approach that learns to plan using RGB-D sensors while providing explicit bounds on generalization performance. In particular, the guarantees associated with our approach take the form of probably approximately correct (PAC) [23] bounds on the performance of the learned vision-based planner in novel environments. Concretely, given a set of training environments, we learn a planner with a provable bound on its expected performance in novel environments (e.g., a bound on the probability of collision). This bound holds with high probability (over sampled training environments) under the assumption that training environments and novel environments are drawn from the same (but unknown) underlying distribution (see Section II for a formal description of our problem formulation).

Statement of Contributions. To our knowledge, the results in this paper constitute the first attempt to provide guarantees on generalization performance for vision-based planning using neural networks. To this end, we make three primary contributions. First, we develop a framework that leverages PAC-Bayes generalization theory [23] for learning to plan in a receding-horizon manner using a library of motion primitives. The planners trained using this framework are accompanied by certificates of performance on novel environments in the form of PAC bounds. Second, we present algorithms based on evolutionary strategies [38] and convex optimization (relative entropy programming [5]) for learning to plan with high-dimensional sensory feedback (e.g., RGB-D) by explicitly optimizing the PAC-Bayes bounds on the performance. Finally, we demonstrate the ability of our approach to provide strong generalization bounds for vision-based motion planners through two examples: navigation of an Unmanned Aerial Vehicle (UAV) across obstacle fields (see Fig. 1) and locomotion of a quadruped across rough terrains (see Fig. 1). In both examples, we obtained PAC-Bayes bounds that guarantee successful traversal of of the environments on average.

### I-a Related Work

Planning with Motion Primitive Libraries. Motion primitive libraries comprise of pre-computed primitive trajectories that can be sequentially composed to generate a rich class of motions. The use of such libraries is prevalent in the literature on motion planning; a non-exhaustive list of examples includes navigation of UAVs [21], balancing [18] and navigation [26] of humanoids, grasping [1], and navigation of autonomous ground vehicles [31]. Several approaches furnish theoretical guarantees on the composition of primitive motions, such as: manuever automata [10], composition of funnel libraries by estimating regions-of-attraction [3, 34, 21], and leveraging the theory of switched systems [35]. However, unlike the present paper, none of the above provide theoretical guarantees for vision-based planning by composing motion primitives.

Planning with Vision. The advent of Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) have recently boosted interest in learning vision-based motion planners. Certain recent methods can be classified in one of three categories: (1) self-supervised learning approaches [37, 12] that uncover low-dimensional latent-space representations of the visual data before planning; (2) imitation learning approaches [2, 33, 28] that leverage an expert’s motion plans; (3) and deep reinforcement learning (RL) [39, 9, 11, 17] that performs RL with visual data and uncovers the latent-space representations relevant to the planning task. Planning guarantees presented in the above papers are limited to the learned low-dimensional latent space embeddings and do not necessarily translate to the actual execution of the plan. In this paper, we provide generalization guarantees for the execution of our neural network planning policies in novel environments.

Generalization Guarantees on Motion Plans. Generalization theory was developed in the context of supervised learning to provide bounds on a learned model’s performance on novel data [32]. In the domain of robotics, these PAC generalization bounds were used in [13] to learn a stochastic robot model from experimental data. PAC bounds were also adopted by the controls community to learn robust controllers [36, 4]; however, their use has not extended to vision-based DNN controllers/policies. In this paper, we use the PAC-Bayes framework, which has recently been successful in providing generalization bounds for DNNs for supervised learning [8, 29]. Our previous work [20, 19] developed the PAC-Bayes Control framework to provide generalization bounds on learned control policies. This paper differs from our previous work in three main ways. (1) We plan using motion primitive libraries instead of employing reactive control policies. This allows us to embed our knowledge of the robot’s dynamics while simultaneously reducing the complexity of the policies. (2) We perform deep RL with PAC-Bayes guarantees using rich sensory feedback (e.g., depth map) on DNN policies. (3) Finally, this paper contributes various algorithmic developments. We develop a training pipeline using Evolutionary Strategies (ES) [38] to obtain a prior for the PAC-Bayes optimization. Furthermore, we develop an efficient Relative Entropy Program (REP)-based PAC-Bayes optimization with the recent quadratic-PAC-Bayes bound [29, Theorem 1] that was shown to be tighter than the bound used in [20, 19].

## Ii Problem Formulation

We consider robotic systems with discrete-time dynamics:

(1) |

where is the time-step, is the robot’s state, is the control input, and is the robot’s “environment”. We use the term “environment” broadly to refer to any exogenous effects that influence the evolution of the robot’s state; e.g., the geometry of an obstacle environment that a UAV must navigate or the geometry of terrain that a legged robot must traverse. In this paper we will make the following assumption.

###### Assumption 1

There is an underlying unknown distribution over the space of all environments that the robot may be deployed in. At training time, we are provided with a dataset of environments drawn i.i.d. from .

It is important to emphasize that we do not assume any explicit characterization of or . We only assume indirect access to in the form of a training dataset (e.g., a dataset of building geometries for the problem of UAV navigation).

Let be the robot’s extereoceptive sensor (e.g., vision) that furnishes an observation from a state and an environment . Further, let be the robot’s proprioceptive sensor mapping that maps the robot’s state to a sensor output . We aim to learn control policies that have a notion of planning embedded in them. In particular, we will work with policies that utilize rich sensory observations , e.g., vision or depth, to plan the execution of a motion primitive from a library in a receding-horizon manner. Each member of is a (potentially time-varying) proprioceptive controller and the index set is compact.

We assume the availability of a cost function that defines the robot’s task. For the sake of simplicity, we will assume that the environment captures all sources of stochasticity (including random initial conditions); thus, the cost associated with deploying policy on a particular environment (over a given time horizon ) is deterministic. In order to apply PAC-Bayes theory, we assume that the cost is bounded. Without further loss of generality, we assume . As an example in the context of navigation, the cost function may assign a cost of 1 for colliding with an obstacle in a given environment (during a finite time horizon) and a cost of 0 otherwise.

The goal of this work is to learn policies that minimize the expected cost across novel environments drawn from :

(2) |

As the distribution over environments is unknown, a direct computation of for the purpose of the minimization in (2) is infeasible. The PAC-Bayes framework [22, 23] provides us an avenue to alleviate this problem. However, in order to leverage it, we will work with a slightly more general problem formulation. In particular, we learn a distribution over the space of policies instead of finding a single policy. When the robot is faced with a given environment, it first randomly selects a policy using and then executes this policy. The corresponding optimization problem is:

where is the space of probability distributions over . We emphasize that the distribution over environments is unknown to us. We are only provided a finite training dataset to learn from; solving thus requires finding (distributions over) policies that generalize to novel environments.

## Iii PAC-Bayes Control

We now describe the PAC-Bayes Control approach developed in [20, 19] and perform suitable extensions for vision-based planning using motion primitives. Let denote a space of policies parameterized by weight vectors that determine the mapping from observations in to primitives in . Specifically, the parameters will correspond to weights of a neural network. Let represent a “prior” distribution over control policies obtained by specifying a distribution over the parameter space . The PAC-Bayes approach requires this prior to be chosen independently of the dataset of training environments. As described in Section II, our goal is to learn a distribution over policies that minimizes the objective in . We will refer to as the “posterior”. We note that the prior and the posterior need not be Bayesian. We define the empirical cost associated with a particular choice of posterior as the average (expected) cost across training environments in :

(3) |

The PAC-Bayes Control result then can be stated as follows.

###### Theorem 1

For any and posterior , with probability over sampled environments , the following inequalities hold:

(4) | ||||

(5) |

where is defined as:

(6) |

The bound (4) was proved in [20, Theorem 2]. The proof of (5) follows analogous to that of [20, Theorem 2] with the only difference being that we use [29, Theorem 1] in the place of [20, Corollary 1].

This result provides an upper bound (that holds with probability ) on our primary quantity of interest: the objective in . In other words, it allows us to bound the true expected cost of a posterior policy distribution across environments drawn from the (unknown) distribution . Theorem 1 suggests an approach for choosing a posterior over policies; specifically, one should choose a posterior that minimizes the bounds on . The bounds are a composite of two quantities: the empirical cost and a “regularization” term (both of which can be computed given the training dataset and a prior ). Intuitively, minimizing these bounds corresponds to minimizing a combination of the empirical cost and a regularizer that prevents one from overfitting to the specific training environments.

For solving , we can either minimize (4) or (5). Intuitively, we would like to use the tighter one of the two. The following proposition addresses this concern by analytically identifying regimes where (5) is tighter than (4) and vice-versa.

###### Proposition 1

The proof of this proposition is detailed in Appendix -A.

Proposition 1 shows that (5) is tighter than (4) if and only if the upper bound of (5) is smaller than . On the other hand, we also have that (4) is tighter than (5) if and only if the upper bound of (5) is greater than . Hence, in our PAC-Bayes training algorithm we will use (4) when and (5) when .

## Iv Training

In this section we present our methodology for training vision-based planning policies that can provably perform well on novel environments using the PAC-Bayes Control framework. The PAC-Bayes framework permits the use of any prior distribution (independent of the training data) on the policy space. However, an uninformed choice of could result in vacuous bounds [8]. Therefore, obtaining strong PAC-Bayes bounds with efficient sample complexity calls for a good prior on the policy space. For DNNs, the choice of a good prior is often unintuitive. To remedy this, we split a given training dataset into two parts: and . We use the Evolutionary Strategies (ES) framework to train a prior using the training data in ; more details are provided in Section IV-A. Leveraging this prior, we perform PAC-Bayes optimization on the training data in ; further details on the PAC-Bayes optimization are presented in Section IV-B.

### Iv-a Training A PAC-Bayes Prior With ES

We train the prior distribution on the policy space by minimizing the empirical cost on environments belonging to the set with cardinality . In particular, we choose to be a multivariate Gaussian distribution with a mean and a diagonal covariance . Let be the element-wise square-root of the diagonal of . Our training is performed using the class of RL algorithms known as Evolutionary Strategies (ES) [38]. ES provides multiple benefits in our setting: (a) The presence of a physics engine in the training loop prohibits backpropagation of analytical gradients. Since the policies in our setting are DNNs with hundreds and thousands of parameters, a naive finite-difference estimate of the gradient would be too computationally prohibitive. ES permits gradient estimation with significantly lower number of rollouts (albeit resulting in noisy estimates). (b) ES directly supplies us a distribution on the policy space, thereby meshing well with the PAC-Bayes Control framework. (c) The ES gradient estimation can be conveniently parallelized in order to leverage cloud computing resources.

Adapting (3) for , we can express the gradient of the empirical cost with respect to (w.r.t.) as:

(7) |

Following the ES framework from [30, 38], the gradient of the empirical cost for any w.r.t. the mean is:

(8) |

where is the Hadamard division (element-wise division). Similarly, the gradient of the empirical cost w.r.t. is:

(9) |

where is the Hadamard product (element-wise product) and is a vector of ’s with dimension . Hence, (8) and (9) allow a Monte-Carlo estimation of the gradient. One can derive (8) and (9) using the diagonal covariance structure of and the reparameterization trick^{2}

Estimating the gradients directly from (8) and (9) leads to poor convergence of the cost function due to high-variance in the gradient estimate. Indeed, this observation has been shared in the past literature [30, 38]. To reduce the variance in the gradient estimate we perform antithetic sampling, as suggested in [30], i.e., for every we also evaluate the policy corresponding to for estimating the gradient. If we sample number of , then the Monte-Carlo estimate of the gradient with antithetic sampling will be:

(10) | ||||

(11) |

Now we turn our attention towards the implementation aspects of ES. We exploit the high parallelizability of ES by splitting the environments into mini-batches , where is the number of CPU workers available to us. Each worker computes the gradient of the cost associated with each environment in using Algorithm 1 and returns their sum to the main process, where the gradients are averaged over all environments; see Algorithm 3 in Appendix A-B. These estimated gradients are then passed to a gradient-based optimizer for updating the distribution; we use the Adam optimizer [15]. We circumvent the non-negativeness constraint of the standard deviation by optimizing^{3}

### Iv-B Training a PAC-Bayes Policy

This section details our approach for minimizing the PAC-Bayes upper-bounds in Theorem 1 to obtain provably generalizable posterior distributions on the policy-space. We begin by restricting our policy space to a finite set as follows:

Let be a prior^{4}

The primary benefit of working over a finite policy space is that it allows us to formulate the problem of minimizing the PAC-Bayes bounds (4) and (5) using convex optimization. As described in [19, Section 5.1], optimization of the PAC-Bayes bound (4) for can be achieved using a relative entropy program (REP); REPs are efficiently-solvable convex programs in which a linear functional of the decision variables is minimized subject to constraints that are linear or of the form [5, Section 1.1]. The remainder of this section will formulate the optimization of the bound (5) as a parametric REP. The resulting algorithm (Algorithm 2) finds a posterior distribution over that minimizes the PAC-Bayes bound (arbitrarily close to the global infimum).

Let be the policy-wise cost vector, each entry of which holds the average cost of running policy on the environments . Then, the empirical cost can be expressed linearly in as . Hence, the minimization of the PAC-Bayes bound (5) can be written as:

s.t. |

Introducing the scalars , , and , this optimization can be equivalently expressed as:^{5}

(12) | |||||

(13) | |||||

(14) | |||||

(15) |

This is an REP for a fixed and . Hence, we can perform a grid search on these scalars and solve for each fixed tuple of parameters . We can control the density of the grid on to get arbitrarily close to:

(16) |

The search space for is impractically large in (16). The following Proposition remedies this by shrinking the search space on and to compact intervals, thereby allowing for an efficient algorithm to solve (16).

###### Proposition 2

Our implementation of the PAC-Bayes optimization is detailed in Algorithm 2. In practice, we sweep across and for each fixed we perform a bisectional search on (line 13-14 in Algorithm 2). We make our bounds for in Proposition 2 tighter by replacing with the chosen in the expressions of and (line 10-11 in Algorithm 2). Furthermore, we choose (line 6 in Algorithm 2). The REP that arises by fixing is solved using CVXPY [7] with the MOSEK solver [25]. Finally, we post-process the solution of Algorithm 2 to obtain a tighter PAC-Bayes bound by computing the KL-inverse between the empirical cost and using an REP; further details are provided in Appendix -B.

## V Examples

In this section we use the algorithms developed in Section IV on two examples: (1) vision-based navigation of a UAV in novel obstacle fields, and (2) locomotion of a quadrupedal robot across novel rough terrains using proprioceptive and exteroceptive sensing. Through these examples, we demonstrate the ability of our approach to train vision-based DNN policies with strong generalization guarantees in a deep RL setting. The simulation and training are performed with PyBullet [6] and PyTorch [27] respectively. Our code is available at: https://github.com/irom-lab/PAC-Vision-Planning

### V-a Vision-Based UAV Navigation

In this example we train a quadrotor to navigate across an obstacle field without collision using depth maps from an onboard vision sensor; see Fig. 1 for an illustration.

Environment. For the sake of visualization, we made the “roof” of the obstacle course transparent in Fig. 1 and the videos. The true obstacle course is a (red) tunnel cluttered by cylindrical obstacles; see Fig. 4. The (unknown) distribution over environments is chosen by drawing obstacle radii and locations from a uniform distribution on and , respectively. The orientation is generated by drawing a quaternion using a normal distribution.

Motion Primitives. We work with a library of motion primitives for the quadrotor. The motion primitives are generated by connecting the initial position and the final desired position of the quadrotor by a smooth sigmoidal trajectory; Fig. 2 illustrates the sigmoidal trajectory for each primitive in our library. The robot moves along these trajectories at a constant speed and yaw and the roll-pitch are recovered by exploiting the differential flatness of the quadrotor [24, Section III].

Planning Policy. Our control policy maps a depth map to a score vector and then selects the motion primitive with the highest score; see the last layer in Fig. 3. A typical depth-map from the onboard sensor is visualized in Fig. 4. We model our policy as a DNN with a ResNet-like architecture (illustrated in Fig. 3). The policy processes the depth map along two parallel branches: the Depth Filter (which is fixed and has no parameters to learn) and the Residual Network (which is a DNN). Both branches generate a score for each primitive in which are summed to obtain the final aggregate score. The Depth Filter embeds the intuition that the robot can avoid collisions with obstacles by moving towards the “deepest” part of the depth map. We construct the Depth Filter by projecting the quadrotor’s position after executing a primitive on the depth map to identify the pixels where the quadrotor would end up; see the grid in Fig. 4 where each cell corresponds to the ending position of a motion primitive. The Depth Filter then applies a mask on the depth map that zeros out all pixels outside this grid and computes the average-depth of each cell in the grid which is treated as a primitive score from that branch of the policy. Note that this score is based only on the ending position of the quadrotor; the entire primitive trajectory when projected onto the depth map can lie outside the grid in Fig. 4. Therefore, to improve the policy’s performance, we train the Residual Network. Intuitively, this augments the scores from the Depth Filter branch by processing the entire depth map. The Residual Network is a DNN with parameters: .

Training Summary. We choose the cost as where is the time at which the robot collides with an obstacle and is the total time horizon; in our example seconds. For the primitives in Fig. 2, the quadrotor moves monotonically forward. Hence, time is analogous to the forward displacement. The prior is trained using the method described in Section IV-A on an AWS EC2 g3.16xlarge instance with a run-time of 22 hours. PAC-Bayes optimization is performed with Algorithm 2 on a desktop with a 3.30 GHz i9-7900X CPU with 10 cores, 32 GB RAM, and a 12 GB NVIDIA Titan XP GPU. The bulk of the time in executing Algorithm 2 is spent on computing the costs in line 5 (i.e., 400 sec, 800 sec, and 1600 sec for the results in Table I from top to bottom), whereas solving (16) takes sec; Table III in the appendix provides the hyperparameters.

Results. The PAC-Bayes results are detailed in Table I. Here we choose (implying that the PAC-bounds hold with probability ) and vary ; see Appendix -B for details on the KL-inverse PAC-Bayes bound provided in Table I. We also perform an exhaustive simulation of the trained posterior on novel environments to empirically estimate the true cost. It can be observed that our PAC-Bayes bounds show close correspondence to the empirical estimate of the true cost. To facilitate the physical interpretation of the results in Table I, consider the last row with : according to our PAC-Bayes guarantee, with probability the quadrotor will (on average) get through of previously unseen obstacle courses. Videos of representative trials on the test environments can be found at: https://youtu.be/03qq4sLU34o

# Environments | PAC-Bayes Cost | True Cost | ||
---|---|---|---|---|

N | KL Inv. | (Estimate) | ||

1000 | 26.02% | - | 24.89% | 18.34% |

2000 | - | 22.99% | 22.28% | 18.42% |

4000 | - | 21.55% | 21.11% | 18.43% |

### V-B Quadrupedal Locomotion on Uneven Terrain

Environment. In this example, we train the quadrupedal robot Minitaur [14] to traverse an uneven terrain characterized by slopes uniformly sampled between to ; see Fig. 1 for an illustration of a representative environment. We use the minitaur_gym_env in PyBullet to simulate the full nonlinear/hybrid dynamics of the robot. The objective here is to train a posterior distribution on policies that enables the robot to cross a finish-line (depicted in red in Fig. 1) situated m in the initial heading direction .

Motion Primitives. We use the sine controller that is available in the minitaur_gym_env as well as Minitaur’s SDK developed by Ghost Robotics. The sine controller generates desired motor angles based on a sinusoidal function that depends on the stepping amplitudes , , steering amplitude , and angular velocity as follows:

These four desired angles are communicated to the eight motors of the Minitaur (two per leg) in the following order: . Hence, our primitives are characterized by the scalars , and rad/s, rendering the library of motion primitives uncountable but compact. Each primitive is executed for a time-horizon of 0.5 seconds as it roughly corresponds to one step for the robot.

Planning Policy. The policy selects a primitive based on the depth map and the proprioceptive feedback (8 motor angles, 8 motor angular velocities, and the six-dimensional position and orientation of the robot’s torso). A visualization of the depth map used by the policy can be found in Fig. 5. The policy we train is a DNN with parameters: The depth map is passed through a convolutional layer and the features are concatenated with the proprioceptive feedback and processed as follows: . Let the 4 outputs be denoted by . Then, the primitive is assigned as:

Intuitively, the transformation above from to ensures that the robot has a minimum forward speed and a maximum steering speed.

Training Summary. The cost we use is where is the robot’s displacement at the end of the rollout along the initial heading direction and is the distance to the finish-line from the robot’s initial position along ; m for the results in this paper. A rollout is considered complete when the robot crosses the finish-line or if the rollout time exceeds seconds. All the training in this example is performed on a desktop with a 3.30 GHz i9-7900X CPU with 10 cores, 32 GB RAM, and a 12 GB NVIDIA Titan XP GPU. As before, we train the prior using the method described in Section IV-A, the execution of which takes 2 hours. PAC-Bayes optimization is performed using Algorithm 2. The execution of line 5 in Algorithm 2 takes 65 min, 130 min, and 260 min for the results in Table II from top to bottom, respectively, whereas solving (16) takes 1 sec. Table III in the Appendix details the relevant hyperparameters.

Results. The PAC-Bayes results are detailed in Table II where and the number of environments is varied; see Appendix -B for details on the KL-inverse PAC-Bayes bound provided in Table I. The PAC-Bayes cost can be interpreted in a manner similar to the quadrotor; e.g., for PAC-Bayes optimization with , with probability , the quadruped will (on average) traverse () of the previously unseen environments. Videos of representative trials on the test environment can be found at: https://youtu.be/03qq4sLU34o

# Environments | PAC-Bayes Cost | True Cost | ||
---|---|---|---|---|

N | KL Inv. | (Estimate) | ||

500 | 25.65% | - | 23.75% | 16.89% |

1000 | - | 22.92% | 21.60% | 16.78% |

2000 | - | 21.40% | 20.23% | 16.80% |

## Vi Conclusions

We presented a deep reinforcement learning approach for synthesizing vision-based planners with certificates of performance on novel environments. We achieved this by directly optimizing a PAC-Bayes generalization bound on the average cost of the policies over all environments. To obtain strong generalization bounds, we devised a two-step training pipeline. First, we use ES to train a good prior distribution on the space of policies. Then, we use this prior in a PAC-Bayes optimization to find a posterior that minimizes the PAC-Bayes bound. The PAC-Bayes optimization is formulated as a parametric REP that can be solved efficiently. Our examples demonstrate the ability of our approach to train DNN policies with strong generalization guarantees.

Future Work. There are a number of exciting future directions that arise from this work. We believe that our approach can be extended to provide PAC-Bayes certificates of generalization for long-horizon vision-based motion plans. In particular, we are exploring the augmentation of our policy with a generative network that can predict the future visual observations conditioned on the primitive to be executed. Another direction we are excited to pursue is training vision-based policies that are robust to unknown disturbances (e.g., wind gusts) which are not a part of the training data. Specifically, we hope to address this challenge by bridging the approach in this paper with the model-based robust planning approaches in the authors’ previous work [21, 35]. Finally, we are also working towards a hardware implementation of our approach on a UAV and the Minitaur. We hope to leverage recent advances in sim-to-real transfer to minimize training on the actual hardware.

### -a Proofs

{proof}[Proposition 1]
We will begin with (i)^{6}

(17) | ||||

(18) | ||||

(19) | ||||

(20) | ||||

(21) |

where the step from (18) to (19) follows by multiplying both sides with , noting that , and adding ; and the step from (19) to (20) follows by noting that .

The statement (ii) holds because it is the contrapositive of (i). Finally, the proof of (iii) follows by noting that for both (i) and (ii) hold simultaneously, i.e., as well as .

[Proposition 2] The bound follows by noting that for any outside that interval the constraints (14) and (15) will be violated.

Let be the space of probability distributions on . To get the bound on first observe that for all :

(22) |

Now, using the above and in (13), we get:

Furthermore, it can be verified that:

### -B PAC-Bayes Bound with KL inverse [20][19]

Consider the following PAC-Bayes bound:

###### Theorem 2 ([23, 22])

For any , with probability at least over samples , the following inequalities hold:

(23) |

Here, the KL-inverse is defined as follows:

(24) |

The PAC-bounds (4) and (5) are derived from Theorem 2. Hence, (23) is tighter than the both of them. However, computing the gradient of the KL-inverse analytically is challenging, necessitating the use of (4) and (5) for the PAC-Bayes optimization. Nevertheless, after obtaining an “optimal” posterior based on (4) or (5), we can obtain a tighter PAC-Bayes cost using Theorem 2 by leveraging a key observation: the KL inverse is readily expressed as the optimal value of a simple Relative Entropy Program. In particular, the expression for the KL inverse in (24) corresponds to an optimization problem with a (scalar) decision variable , a linear cost function (i.e., ), linear inequality constraints (i.e., ), and a constraint on the KL divergence between the decision variable and the constant . We can thus compute the KL inverse exactly (up to numerical tolerances) using convex optimization (e.g., interior point methods [5]).

## Appendix A Derivation of Optimization Problem for Finite Policy Spaces

### A-a Hyperparameters

The hyperparameters used for the examples in this paper are detailed in Table III.

### A-B Complete ES Algorithm

A complete implementation of the ES algorithm used in the paper for training the prior is supplied in Algorithm 3.

### Footnotes

- For notational convenience we are dropping the dependence on , .
- where
- The notation is overloaded to mean element-wise .
- As before, it is understood that is a diagonal covariance matrix with its diagonal entries being the element-wise square of .
- If we have an infeasible , then we assume that .
- For notational convenience we drop the dependence of , on .

### References

- (2007) Grasp planning in complex scenes. In Proceedings of IEEE-RAS International Conference on Humanoid Robots, pp. 42–48. Cited by: §I-A.
- (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §I-A, §I.
- (1999) Sequential composition of dynamically dexterous robot behaviors. The International Journal of Robotics Research 18 (6), pp. 534–555. Cited by: §I-A.
- (2019) Scenario optimization for mpc. In Handbook of Model Predictive Control, S. V. Raković and W. S. Levine (Eds.), pp. 445–463. External Links: ISBN 978-3-319-77489-3 Cited by: §I-A.
- (2017) Relative entropy optimization and its applications. Mathematical Programming 161 (1-2), pp. 1–32. Cited by: §-B, §I, §IV-B.
- (2018) Pybullet, a python module for physics simulation for games, robotics and machine learning. Cited by: §V.
- (2016) CVXPY: a Python-embedded modeling language for convex optimization. Journal of Machine Learning Research 17 (83), pp. 1–5. Cited by: §IV-B.
- (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008. Cited by: §I-A, §IV.
- (2018) Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568. Cited by: §I-A.
- (2005) Maneuver-based motion planning for nonlinear systems with symmetries. IEEE Transactions on Robotics 21 (6), pp. 1077–1091. Cited by: §I-A.
- (2018) Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pp. 2450–2462. Cited by: §I-A.
- (2019) Robot motion planning in learned latent spaces. IEEE Robotics and Automation Letters 4 (3), pp. 2407–2414. Cited by: §I-A, §I.
- (2015) Probabilistically valid stochastic extensions of deterministic models for systems with uncertainty. The International Journal of Robotics Research 34 (10), pp. 1278–1295. Cited by: §I-A.
- (2016) Design principles for a family of direct-drive legged robots. IEEE Robotics and Automation Letters 1 (2), pp. 900–907. Cited by: §V-B.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-A.
- (2001) Randomized kinodynamic planning. The International Journal of Robotics Research 20 (5), pp. 378–400. Cited by: §I.
- (2019) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953. Cited by: §I-A.
- (2009) Standing balance control using a trajectory library. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3031–3036. Cited by: §I-A.
- (2019) PAC-Bayes Control: learning policies that provably generalize to novel environments. arXiv preprint arXiv:1806.04225. Cited by: §-B, §I-A, §III, §IV-B.
- (2018) PAC-Bayes Control: synthesizing controllers that provably generalize to novel environments. In Proceedings of the Conference on Robot Learning, Cited by: §-B, §I-A, §III, §III.
- (2017-07) Funnel libraries for real-time robust feedback motion planning. The International Journal of Robotics Research 36 (8), pp. 947–982. Cited by: §I-A, §I, §VI.
- (2004) A note on the PAC Bayesian theorem. arXiv preprint cs/0411099. Cited by: §II, Theorem 2.
- (1999) Some PAC-Bayesian theorems. Machine Learning 37 (3), pp. 355–363. Cited by: §I, §I, §II, Theorem 2.
- (2011) Minimum snap trajectory generation and control for quadrotors. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 2520–2525. Cited by: §V-A.
- (2019) MOSEK fusion api for python 9.0.84(beta). External Links: Link Cited by: §IV-B.
- (2016) Composing limit cycles for motion planning of 3d bipedal walkers. In Proceedings of the IEEE Conference on Decision and Control, pp. 6368–6374. Cited by: §I-A, §I.
- (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §V.
- (2019) Motion planning networks. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 2118–2124. Cited by: §I-A.
- (2019) PAC-Bayes with backprop. arXiv preprint arXiv:1908.07380. Cited by: §I-A, §III.
- (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §IV-A, §IV-A, §IV-A.
- (2008) Learning maneuver dictionaries for ground robot planning. In Proceedings of the International Symposium on Robotics (ISR), Cited by: §I-A, §I.
- (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press. Cited by: §I-A.
- (2018) Universal planning networks. arXiv preprint arXiv:1804.00645. Cited by: §I-A, §I.
- (2010) LQR-trees: feedback motion planning via sums-of-squares verification. The International Journal of Robotics Research 29 (8), pp. 1038–1052. Cited by: §I-A.
- (2019) Switched systems with multiple equilibria under disturbances: boundedness and practical stability. IEEE Transactions on Automatic Control. Cited by: §I-A, §VI.
- (2001) Randomized algorithms for robust controller synthesis using statistical learning theory. Automatica 37 (10), pp. 1515–1528. Cited by: §I-A.
- (2015) Embed to control: a locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems, pp. 2746–2754. Cited by: §I-A.
- (2014) Natural evolution strategies. The Journal of Machine Learning Research 15 (1), pp. 949–980. Cited by: §I-A, §I, §IV-A, §IV-A, §IV-A, §IV-A.
- (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 3357–3364. Cited by: §I-A.