# EPOpt: Learning Robust Neural Network Policies Using Model Ensembles

## Abstract

Sample complexity and safety are major challenges when learning policies with reinforcement learning for real-world tasks, especially when the policies are represented using rich function approximators like deep neural networks. Model-based methods where the real-world target domain is approximated using a simulated source domain provide an avenue to tackle the above challenges by augmenting real data with simulated data. However, discrepancies between the simulated source domain and the target domain pose a challenge for simulated training. We introduce the EPOpt algorithm, which uses an ensemble of simulated source domains and a form of adversarial training to learn policies that are robust and generalize to a broad range of possible target domains, including unmodeled effects. Further, the probability distribution over source domains in the ensemble can be adapted using data from target domain and approximate Bayesian methods, to progressively make it a better approximation. Thus, learning on a model ensemble, along with source domain adaptation, provides the benefit of both robustness and learning/adaptation.

## 1Introduction

Reinforcement learning with powerful function approximators like deep neural networks (deep RL) has recently demonstrated remarkable success in a wide range of tasks like games [19], simulated control problems [16], and graphics [23]. However, high sample complexity is a major barrier for directly applying model-free deep RL methods for physical control tasks. Model-free algorithms like Q-learning, actor-critic, and policy gradients are known to suffer from long learning times [13], which is compounded when used in conjunction with expressive function approximators like deep neural networks (DNNs). The challenge of gathering samples from the real world is further exacerbated by issues of safety for the agent and environment, since sampling with partially learned policies could be unstable [10]. Thus, model-free deep RL methods often require a prohibitively large numbers of potentially dangerous samples for physical control tasks.

Model-based methods, where the real-world target domain is approximated with a simulated source domain, provide an avenue to tackle the above challenges by learning policies using simulated data. The principal challenge with simulated training is the systematic discrepancy between source and target domains, and therefore, methods that compensate for systematic discrepancies (modeling errors) are needed to transfer results from simulations to real world using RL. We show that the impact of such discrepancies can be mitigated through two key ideas: (1) training on an ensemble of models in an adversarial fashion to learn policies that are robust to parametric model errors, as well as to unmodeled effects; and (2) adaptation of the source domain ensemble using data from the target domain to progressively make it a better approximation. This can be viewed either as an instance of model-based Bayesian RL [11]; or as transfer learning from a collection of simulated source domains to a real-world target domain [32]. While a number of model-free RL algorithms have been proposed (see, e.g., [7] for a survey), their high sample complexity demands use of a simulator, effectively making them model-based. We show in our experiments that such methods learn policies which are highly optimized for the specific models used in the simulator, but are brittle under model mismatch. This is not surprising, since deep networks are remarkably proficient at exploiting any systematic regularities in a simulator. Addressing robustness of DNN-policies is particularly important to transfer their success from simulated tasks to physical systems.

In this paper, we propose the Ensemble Policy Optimization (EPOpt) algorithm for finding policies that are robust to model mismatch. In line with model-based Bayesian RL, we learn a policy for the target domain by alternating between two phases: (i) given a source (model) distribution (i.e. ensemble of models), find a robust policy that is competent for the whole distribution; (ii) gather data from the target domain using said robust policy, and adapt the source distribution. EPOpt uses an ensemble of models sampled from the source distribution, and a form of adversarial training to learn robust policies that generalize to a broad range of models. By robust, we mean insensitivity to parametric model errors and broadly competent performance for *direct-transfer* (also referred to as *jumpstart* like in [32]). Direct-transfer performance refers to the average initial performance (return) in the target domain, without any direct training on the target domain. By adversarial training, we mean that model instances on which the policy performs poorly in the source distribution are sampled more often in order to encourage learning of policies that perform well for a wide range of model instances. This is in contrast to methods which learn highly optimized policies for specific model instances, but brittle under model perturbations. In our experiments, we did not observe significant loss in performance by requiring the policy to work on multiple models (for example, through adopting a more conservative strategy). Further, we show that policies learned using EPOpt are robust even to effects not modeled in the source domain. Such unmodeled effects are a major issue when transferring from simulation to the real world. For the model adaptation step (ii), we present a simple method using approximate Bayesian updates, which progressively makes the source distribution a better approximation of the target domain. We evaluate the proposed methods on the hopper (12 dimensional state space; 3 dimensional action space) and half-cheetah (18 dimensional state space; 6 dimensional action space) benchmarks in MuJoCo. Our experimental results suggest that adversarial training on model ensembles produces robust policies which generalize better than policies trained on a single, maximum-likelihood model (of source distribution) alone.

## 2Problem Formulation

We consider parametrized Markov Decision Processes (MDPs), which are tuples of the form: where , are (continuous) states and actions respectively; , and are the state transition, reward function, and initial state distribution respectively, all parametrized by ; and is the discount factor. Thus, we consider a set of MDPs with the same state and action spaces. Each MDP in this set could potentially have different transition functions, rewards, and initial state distributions. We use transition functions of the form where is a random process and is a random variable.

We distinguish between source and target MDPs using and respectively. We also refer to and as source and target domains respectively, as is common in the transfer learning set-up. Our objective is to learn the optimal policy for ; and to do so, we have access to . We assume that we have a distribution () over the source domains (MDPs) generated by a distribution over the parameters that capture our subjective belief about the parameters of . Let be parametrized by (e.g. mean, standard deviation). For example, could be a hopping task with reward proportional to hopping velocity and falling down corresponds to a terminal state. For this task, could correspond to parameters like torso mass, ground friction, and damping in joints, all of which affect the dynamics. Ideally, we would like the target domain to be in the model class, i.e. . However, in practice, there are likely to be unmodeled effects, and we analyze this setting in our experiments. We wish to learn a policy that performs well for all . Note that this robust policy does not have an explicit dependence on , and we require it to perform well without knowledge of .

## 3Learning protocol and EPOpt algorithm

We follow the round-based learning protocol of Bayesian model-based RL. We use the term *rounds* when interacting with the target domain, and *episode* when performing rollouts with the simulator. In each round, we interact with the target domain after computing the robust policy on the current (i.e. posterior) simulated source distribution. Following this, we update the source distribution using data from the target domain collected by executing the robust policy. Thus, in round , we update two sets of parameters: , the parameters of the robust policy (neural network); and , the parameters of the source distribution. The two key steps in this procedure are finding a robust policy given a source distribution; and updating the source distribution using data from the target domain. In this section, we present our approach for both of these steps.

### 3.1Robust policy search

We introduce the EPOpt algorithm for finding a robust policy using the source distribution. EPOpt is a policy gradient based meta-algorithm which uses batch policy optimization methods as a subroutine. Batch policy optimization algorithms [38] collect a batch of trajectories by rolling out the current policy, and use the trajectories to make a policy update. The basic structure of EPOpt is to sample a collection of models from the source distribution, sample trajectories from each of these models, and make a gradient update based on a subset of sampled trajectories. We first define evaluation metrics for the parametrized policy, :

In (1), is the evaluation of on the model , with being trajectories generated by and : where , , , and . Similarly, is the evaluation of over the source domain distribution. The corresponding expectation is over trajectories generated by and : , where , , , , , and . With this modified notation of trajectories, batch policy optimization can be invoked for policy search.

Optimizing allows us to learn a policy that performs best in expectation over models in the source domain distribution. However, this does not necessarily lead to a robust policy, since there could be high variability in performance for different models in the distribution. To explicitly seek a robust policy, we use a softer version of max-min objective suggested in robust control, and optimize for the conditional value at risk (CVaR) [31]:

where is the set of parameters corresponding to models that produce the worst percentile of returns, and provides the limit for the integral; is the random variable of returns, which is induced by the distribution over model parameters; and is a hyperparameter which governs the level of relaxation from max-min objective. The interpretation is that (2) maximizes the expected return for the worst -percentile of MDPs in the source domain distribution. We adapt the previous policy gradient formulation to approximately optimize the objective in (2). The resulting algorithm, which we call EPOpt-, generalizes learning a policy using an ensemble of source MDPs which are sampled from a source domain distribution.

In Algorithm 1, denotes the discounted return obtained in trajectory sample . In line , we compute the percentile value of returns from the trajectories. In line , we find the subset of sampled trajectories which have returns lower than . Line calls one step of an underlying batch policy optimization subroutine on the subset of trajectories from line . For the CVaR objective, it is important to use a good baseline for the value function. [31] show that without a baseline, the resulting policy gradient is biased and not consistent. We use a linear function as the baseline with a time varying feature vector to approximate the value function, similar to [7]. The parameters of the baseline are estimated using only the subset of trajectories with return less than . We found that this approach led to empirically good results.

For small values of , we observed that using the sub-sampling step from the beginning led to unstable learning. Policy gradient methods adjust parameters of policy to increase probability of trajectories with high returns and reduce probability of poor trajectories. EPOpt due to the sub-sampling step emphasizes penalizing poor trajectories more. This might constrain the initial exploration needed to find good trajectories. Thus, we initially use a setting of for few iterations before setting epsilon to the desired value. This corresponds to exploring initially to find promising trajectories and rapidly reducing probability of trajectories that do not generalize.

### 3.2Adapting the source domain distribution

In line with model-based Bayesian RL, we can adapt the ensemble distribution after observing trajectory data from the target domain. The Bayesian update can be written as:

where is the partition function (normalization) required to make the probabilities sum to 1, is the random variable representing the next state, and are data observed along trajectory . We try to explain the target trajectory using the stochasticity in the state-transition function, which also models sensor errors. This provides the following expression for the likelihood:

We follow a sampling based approach to calculate the posterior, by sampling a set of model parameters: from a sampling distribution, . Consequently, using Bayes rule and importance sampling, we have:

where is the probability of drawing from the prior distribution; and is the likelihood of generating the observed trajectory with model parameters . The weighted samples from the posterior can be used to estimate a parametric model, as we do in this paper. Alternatively, one could approximate the continuous probability distribution using discrete weighted samples like in case of particle filters. In cases where the prior has very low probability density in certain parts of the parameter space, it might be advantageous to choose a sampling distribution different from the prior. The likelihood can be factored using the Markov property as: . This simple model adaptation rule allows us to illustrate the utility of EPOpt for robust policy search, as well as its integration with model adaptation to learn policies in cases where the target model could be very different from the initially assumed distribution.

## 4Experiments

We evaluated the proposed EPOpt- algorithm on the 2D hopper [9] and half-cheetah [37] benchmarks using the MuJoCo physics simulator [34].^{1}*tanh* non-linearity, and the final output layer is made of linear units. Normally distributed independent random variables are added to the output of this neural network, and we also learn the standard deviation of their distributions. Our experiments are aimed at answering the following questions:

How does the performance of standard policy search methods (like TRPO) degrade in the presence of systematic physical differences between the training and test domains, as might be the case when training in simulation and testing in the real world?

Does training on a distribution of models with EPOpt improve the performance of the policy when tested under various model discrepancies, and how much does ensemble training degrade overall performance (e.g. due to acquiring a more conservative strategy)?

How does the robustness of the policy to physical parameter discrepancies change when using the robust EPOpt- variant of our method?

Can EPOpt learn policies that are robust to unmodeled effects – that is, discrepancies in physical parameters between source and target domains that

*do not*vary in the source domain ensemble?When the initial model ensemble differs substantially from the target domain, can the ensemble be adapted efficiently, and how much data from the target domain is required for this?

In all the comparisons, *performance* refers to the average undiscounted return per trajectory or episode (we consider finite horizon episodic problems). In addition to the previously defined performance, we also use the 10 percentile of the return distribution as a proxy for the worst-case return.

### 4.1Comparison to Standard Policy Search

In Figure ?, we evaluate the performance of standard TRPO and EPOpt on the hopper task, in the presence of a simple parametric discrepancy in the physics of the system between the training (source) and test (target) domains. The plots show the performance of various policies on test domains with different torso mass. The first three plots show policies that are each trained on a single torso mass in the source domain, while the last plot illustrates the performance of EPOpt, which is trained on a Gaussian mass distribution. The results show that no single torso mass value produces a policy that is successful in all target domains. However, the EPOpt policy succeeds almost uniformly for all tested mass values. Furthermore, the results show that there is almost no degradation in the performance of EPOpt for any mass setting, suggesting that the EPOpt policy does not suffer substantially from adopting a more robust strategy.

### 4.2Analysis of Robustness

Next, we analyze the robustness of policies trained using EPOpt on the hopper domain. For this analysis, we construct a source distribution which varies four different physical parameters: torso mass, ground friction, foot joint damping, and joint inertia (armature). This distribution is presented in Table ?. Using this source distribution, we compare between three different methods: (1) standard policy search (TRPO) trained on a single model corresponding to the mean parameters in Table ?; (2) EPOpt trained on the source distribution; (3) EPOpt – i.e. the adversarially trained policy, again trained on the previously described source distribution. The aim of the comparison is to study direct-transfer performance, similar to the robustness evaluations common in robust controller design [39]. Hence, we learn a policy using each of the methods, and then test policies on different model instances (i.e. different combinations of physical parameters) without any adaptation. The results of this comparison are summarized in Figure 1, where we present the performance of the policy for testing conditions corresponding to different torso mass and friction values, which we found to have the most pronounced impact on performance. The results indicate that EPOpt produces highly robust policies. A similar analysis for the 10 percentile of the return distribution (softer version of worst-case performance), the half-cheetah task, and different settings are presented in the appendix.

### 4.3Robustness to Unmodeled Effects

To analyze the robustness to unmodeled effects, our next experiment considers the setting where the source domain distribution is obtained by varying friction, damping, and armature as in Table ?, but does not consider a distribution over torso mass. Specifically, all models in the source domain distribution have the same torso mass (value of 6), but we will evaluate the policy trained on this distribution on target domains where the torso mass is different. Figure ? indicates that the EPOpt policy is robust to a broad range of torso masses even when its variation is not considered. However, as expected, this policy is not as robust as the case when mass is also modeled as part of the source domain distribution.

### 4.4Model Adaptation

The preceding experiments show that EPOpt can find robust policies, but the source distribution in these experiments was chosen to be broad enough such that the target domain is not too far from high-density regions of the distribution. However, for real-world problems, we might not have the domain knowledge to identify a good source distribution in advance. In such settings, model (source) adaptation allows us to change the parameters of the source distribution using data gathered from the target domain. Additionally, model adaptation is helpful when the parameters of the target domain could change over time, for example due to wear and tear in a physical system. To illustrate model adaptation, we performed an experiment where the target domain was very far from the high density regions of the initial source distribution, as depicted in Figure ?(a). In this experiment, the source distribution varies the torso mass and ground friction. We observe that progressively, the source distribution becomes a better approximation of the target domain and consequently the performance improves. In this case, since we followed a sampling based approach, we used a uniform sampling distribution, and weighted each sample with the importance weight as described in Section 3.2. Eventually, after 10 iterations, the source domain distribution is able to accurately match the target domain. Figure ?(b) depicts the learning curve, and we see that a robust policy with return more than 2500, which roughly corresponds to a situation where the hopper is able to move forward without falling down for the duration of the episode, can be discovered with just 5 trajectories from the target domain. Subsequently, the policy improves near monotonically, and EPOpt finds a good policy with just 11 episodes worth of data from the target domain. In contrast, to achieve the same level of performance on the target domain, completely model-free methods like TRPO would require more than trajectories when the neural network parameters are initialized randomly.

## 5Related Work

Robust control is a branch of control theory which formally studies development of robust policies [39]. However, typically no distribution over source or target tasks is assumed, and a worst case analysis is performed. Most results from this field have been concentrated around linear systems or finite MDPs, which often cannot adequately model complexities of real-world tasks. The set-up of model-based Bayesian RL maintains a belief over models for decision making under uncertainty [35]. In Bayesian RL, through interaction with the target domain, the uncertainty is reduced to find the correct or closest model. Application of this idea in its full general form is difficult, and requires either restrictive assumptions like finite MDPs [25], gaussian dynamics [26], or task specific innovations. Previous methods have also suggested treating uncertain model parameters as unobserved state variables in a continuous POMDP framework, and solving the POMDP to get optimal exploration-exploitation trade-off [8]. While this approach is general, and allows automatic learning of epistemic actions, extending such methods to large continuous control tasks like those considered in this paper is difficult.

Risk sensitive RL methods [6] have been proposed to act as a bridge between robust control and Bayesian RL. These approaches allow for using subjective model belief priors, prevent overly conservative policies, and enjoy some strong guarantees typically associated with robust control. However, their application in high dimensional continuous control tasks have not been sufficiently explored. We refer readers to [10] for a survey of related risk sensitive RL methods in the context of robustness and safety.

Standard model-based control methods typically operate by finding a maximum-likelihood estimate of the target model [18], followed by policy optimization. Use of model ensembles to produce robust controllers was explored recently in robotics. [20] use a trajectory optimization approach and an ensemble with small finite set of models; whereas we follow a sampling based direct policy search approach over a continuous distribution of uncertain parameters, and also show domain adaptation. Sampling based approaches can be applied to complex models and discrete MDPs which cannot be planned through easily. Similarly, [36] use an ensemble of models, but their goal is to optimize for average case performance as opposed to transferring to a target MDP. [36] use a hand engineered policy class whose parameters are optimized with CMA-ES. EPOpt on the other hand can optimize expressive neural network policies directly. In addition, we show model adaptation, effectiveness of the sub-sampling step ( case), and robustness to unmodeled effects, all of which are important for transfering to a target MDP.

Learning of parametrized skills [4] is also concerned with finding policies for a distribution of parametrized tasks. However, this is primarily geared towards situations where task parameters are revealed during test time. Our work is motivated by situations where target task parameters (e.g. friction) are unknown. A number of methods have also been suggested to reduce sample complexity when provided with either a baseline policy [33], expert demonstration [15], or approximate simulator [30]. These are complimentary to our work, in the sense that our policy, which has good direct-transfer performance, can be used to sample from the target domain and other off-policy methods could be explored for policy improvement.

## 6Conclusions and Future Work

In this paper, we presented the EPOpt- algorithm for training robust policies on ensembles of source domains. Our method provides for training of robust policies, and supports an adversarial training regime designed to provide good direct-transfer performance. We also describe how our approach can be combined with Bayesian model adaptation to adapt the source domain ensemble to a target domain using a small amount of target domain experience. Our experimental results demonstrate that the ensemble approach provides for highly robust and generalizable policies in fairly complex simulated robotic tasks. Our experiments also demonstrate that Bayesian model adaptation can produce distributions over models that lead to better policies on the target domain than more standard maximum likelihood estimation, particularly in presence of unmodeled effects.

Although our method exhibits good generalization performance, the adaptation algorithm we use currently relies on sampling the parameter space, which is computationally intensive as the number of variable physical parameters increase. We observed that (adaptive) sampling from the prior leads to fast and reliable adaptation if the true model does not have very low probability in the prior. However, when this assumption breaks, we require a different sampling distribution which could produce samples from all regions of the parameter space. This is a general drawback of Bayesian adaptation methods. In future work, we plan to explore alternative sampling and parameterization schemes, including non-parametric distributions. An eventual end-goal would be to replace the physics simulator entirely with learned Bayesian neural network models, which could be adapted with limited data from the physical system. These models could be pre-trained using physics based simulators like MuJoCo to get a practical initialization of neural network parameters. Such representations are likely useful when dealing with high dimensional inputs like simulated vision from rendered images or tasks with complex dynamics like deformable bodies, which are needed to train highly generalizable policies that can successfully transfer to physical robots acting in the real world.

## Acknowledgments

The authors would like to thank Emo Todorov, Sham Kakade, and students of Emo Todorov’s research group for insightful comments about the work. The authors would also like to thank Emo Todorov for the MuJoCo simulator. Aravind Rajeswaran and Balaraman Ravindran acknowledge financial support from ILDS, IIT Madras.

## AAppendix

### a.1Description of simulated robotic tasks considered in this work

The hopper task is to make a 2D planar hopper with three joints and 4 body parts hop forward as fast as possible [9]. This problem has a 12 dimensional state space and a 3 dimensional action space that corresponds to torques at the joints. We construct the source domain by considering a distribution over 4 parameters: torso mass, ground friction, armature (inertia), and damping of foot.

The half-cheetah task [37] requires us to make a 2D cheetah with two legs run forward as fast as possible. The simulated robot has 8 body links with an 18 dimensional state space and a 6 dimensional action space that corresponds to joint torques. Again, we construct the source domain using a distribution over the following parameters: torso and head mass, ground friction, damping, and armature (inertia) of foot joints.

A video demonstration of the trained policies on these tasks can be viewed here: Supplimenrary video

( `https://youtu.be/w1YJ9vwaoto`

)

**Reward functions:** For both tasks, we used the standard reward functions implemented with OpenAI gym [3], with minor modifications. The reward structure for hopper task is:

where are the states comprising of joint positions and velocities; are the actions (controls); and is the forward velocity. is a bonus for being alive (). The episode terminates when or when where is the forward pitch of the body.

For the cheetah task, we use the reward function:

the alive bonus is if head of cheetah is above (relative to torso) and similarly episode terminates if the alive condition is violated.

Our implementation of the algorithms and environments are public in this repository to facilitate reproduction of results: `https://github.com/aravindr93/robustRL`

### a.2Hyperparameters

Neural network architecture: We used a neural network with two hidden layers, each with 64 units and

*tanh*non-linearity. The policy updates are implemented using TRPO.Trust region size in TRPO: The maximum KL divergence between sucessive policy updates are constrained to be

Number and length of trajectory rollouts: In each iteration, we sample models from the ensemble, one rollout is performed on each such model. This was implemented in parallel on multiple (6) CPUs. Each trajectory is of length – same as the standard implimentations of these tasks in gym and rllab.

The results in Fig ? and Figure 1 were generated after 150 and 200 iterations of TRPO respectively, with each iteration consisting of 240 trajectories as specified in (3) above.

### a.3Worst-case analysis for hopper task

Figure 1 illustrates the performance of the three considered policies: viz. TRPO on mean parameters, EPOpt, and EPOpt. We similarly analyze the 10 percentile of the return distribution as a proxy for worst-case analysis, which is important for a robust control policy (here, distribution of returns for a given model instance is due to variations in initial conditions). The corresponding results are presented below:

### a.4Robustness analysis for half-cheetah task

### a.5Different settings for

Here, we analyze how different settings for influences the robustness of learned policies. The policies in this section have been trained for 200 iterations with 240 trajectory samples per iteration. Similar to the description in Section 3.1, the first 100 iterations use , and the final 100 iterations use the desired . The source distribution is described in Table 1. We test the performance on a grid over the model parameters. Our results, summarized in Table ?, indicate that decreasing decreases the variance in performance, along with a small decrease in average performance, and hence enhances robustness.

mean |
std |
|||||||

5 | 10 | 25 | 50 | 75 | 90 | |||

0.05 | 2889 | 502 | 1662 | 2633 | 2841 | 2939 | 2966 | 3083 |

0.1 | 3063 | 579 | 1618 | 2848 | 3223 | 3286 | 3336 | 3396 |

0.2 | 3097 | 665 | 1527 | 1833 | 3259 | 3362 | 3423 | 3483 |

0.3 | 3121 | 706 | 1461 | 1635 | 3251 | 3395 | 3477 | 3513 |

0.4 | 3126 | 869 | 1013 | 1241 | 3114 | 3412 | 3504 | 3546 |

0.5 | 3122 | 1009 | 984 | 1196 | 1969 | 3430 | 3481 | 3567 |

0.75 | 3133 | 952 | 1005 | 1516 | 2187 | 3363 | 3486 | 3548 |

1.0 | 3224 | 1060 | 1198 | 1354 | 1928 | 3461 | 3557 | 3604 |

Max-Lik | 1710 | 1140 | 352 | 414 | 646 | 1323 | 3088 | 3272 |

### a.6Importance of baseline for BatchPolOpt

As described in Section 3.1, it is important to use a good baseline estimate for the value function for the batch policy optimization step. When optimizing for the expected return, we can interpret the baseline as a variance reduction technique. Intuitively, policy gradient methods adjust parameters of the policy to improve probability of trajectories in proportion to their performance. By using a baseline for the value function, we make updates that increase probability of trajectories that perform better than average and vice versa. In practice, this variance reduction is essential for getting policy gradients to work. For the CVaR case, [31] showed that without using a baseline, the policy gradient is biased. To study importance of the baseline, we first consider the case where we do not employ the adversarial sub-sampling step, and fix . We use a linear baseline with a time-varying feature vector as described in Section 3.1. Figure ?(a) depicts the learning curve for the source distribution in Table ?. The results indicate that use of a baseline is important to make policy gradients work well in practice.

Next, we turn to the case of . As mentioned in section 3.1, setting a low from the start leads to unstable learning. The adversarial nature encourages penalizing poor trajectories more, which constrains the initial exploration needed to find promising trajectories. Thus we will “pre-train” by using for some iterations, before switching to the desired setting. From Figure ?(a), it is clear that pre-training without a baseline is unlikely to help, since the performance is poor. Thus, we use the following setup for comparison: for 100 iterations, EPOpt is used with the baseline. Subsequently, we switch to EPOpt and run for another 100 iterations, totaling 200 iterations. The results of this experiment are depicted in Figure ?(b). This result indicates that use of a baseline is crucial for the CVaR case, without which the performance degrades very quickly. We repeated the experiment with 100 iterations of pre-training with and without baseline, and observed the same effect. These empirical results reinforce the theoretical findings of [31].

(a) (b)

### a.7Alternate Policy Gradient Subroutines for BatchPolOpt

As emphasized previously, EPOpt is a generic policy gradient based meta algorithm for finding robust policies. The BatchPolOpt step (line 9, Algorithm 1) calls one gradient step of a policy gradient method, the choice of which is largely orthogonal to the main contributions of this paper. For the reported results, we have used TRPO as the policy gradient method. Here, we compare the results to the case when using the classic REINFORCE algorithm. For this comparison, we use the same value function baseline parametrization for both TRPO and REINFORCE. Figure 2 depicts the learning curve when using the two policy gradient methods. We observe that performance with TRPO is significantly better. When optimizing over probability distributions, the natural gradient can navigate the warped parameter space better than the “vanilla” gradient. This observation is consistent with the findings of [12], [28], and [7].

### Footnotes

- Supplementary video:
`https://youtu.be/w1YJ9vwaoto`

### References

**Using inaccurate models in reinforcement learning.**

Pieter Abbeel, Morgan Quigley, and Andrew Y. Ng. In*ICML*, 2006.**A survey of robot learning from demonstration.**

Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. Robotics and Autonomous Systems**OpenAI Gym, 2016.**

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba.**Learning parameterized skills.**

Bruno Castro da Silva, George Konidaris, and Andrew G. Barto. In*ICML*, 2012.**A survey on policy search for robotics.**

Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. Foundations and Trends® in Robotics**Percentile optimization for markov decision processes with parameter uncertainty.**

Erick Delage and Shie Mannor. Operations Research**Benchmarking deep reinforcement learning for continuous control.**

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. In*ICML*, 2016.**Design for an optimal probe.**

Michael O. Duff. In*ICML*, 2003.**Infinite-horizon model predictive control for periodic tasks with contacts.**

Tom Erez, Yuval Tassa, and Emanuel Todorov. In*Proceedings of Robotics: Science and Systems*, 2011.**A comprehensive survey on safe reinforcement learning.**

Javier García and Fernando Fernández. Journal of Machine Learning Research**Bayesian reinforcement learning: A survey.**

Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Foundations and Trends® in Machine Learning**A natural policy gradient.**

Sham Kakade. In*NIPS*, 2001.On the Sample Complexity of Reinforcement Learning

Sham Kakade. .**Approximately optimal approximate reinforcement learning.**

Sham Kakade and John Langford. In*ICML*, 2002.**Guided policy search.**

Sergey Levine and Vladlen Koltun. In*ICML*, 2013.**Continuous control with deep reinforcement learning.**

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. ArXiv e-prints**Reinforcement learning in robust markov decision processes.**

Shiau Hong Lim, Huan Xu, and Shie Mannor. In*NIPS*. 2013.System Identification

Lennart Ljung. , pp. 163–173.**Human-level control through deep reinforcement learning.**

Volodymyr Mnih et al. Nature**Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids.**

I. Mordatch, K. Lowrey, and E. Todorov. In*IROS*, 2015a.**Interactive control of diverse complex characters with neural networks.**

Igor Mordatch, Kendall Lowrey, Galen Andrew, Zoran Popovic, and Emanuel V. Todorov. In*NIPS*. 2015b.**Robust control of markov decision processes with uncertain transition matrices.**

Arnab Nilim and Laurent El Ghaoui. Operations Research**Terrain-adaptive locomotion skills using deep reinforcement learning.**

Xue Bin Peng, Glen Berseth, and Michiel van de Panne. ACM Transactions on Graphics (Proc. SIGGRAPH 2016)**Point-based value iteration for continuous pomdps.**

Josep M. Porta, Nikos A. Vlassis, Matthijs T. J. Spaan, and Pascal Poupart. Journal of Machine Learning Research**An analytic solution to discrete bayesian reinforcement learning.**

Pascal Poupart, Nikos A. Vlassis, Jesse Hoey, and Kevin Regan. In*ICML*, 2006.**Bayesian reinforcement learning in continuous pomdps with application to robot navigation.**

S. Ross, B. Chaib-draa, and J. Pineau. In*ICRA*, 2008.**Agnostic system identification for model-based reinforcement learning.**

Stephane Ross and Drew Bagnell. In*ICML*, 2012.**Trust region policy optimization.**

John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. In*ICML*, 2015.**Mastering the game of go with deep neural networks and tree search.**

David Silver et al. Nature**Integrating a partial model into model free reinforcement learning.**

Aviv Tamar, Dotan Di Castro, and Ron Meir. Journal of Machine Learning Research**Optimizing the cvar via sampling.**

Aviv Tamar, Yonatan Glassner, and Shie Mannor. In*AAAI Conference on Artificial Intelligence*, 2015.**Transfer learning for reinforcement learning domains: A survey.**

Matthew E. Taylor and Peter Stone. Journal of Machine Learning Research**High-confidence off-policy evaluation.**

Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. In*AAAI Conference on Artificial Intelligence*. 2015.**Mujoco: A physics engine for model-based control.**

E. Todorov, T. Erez, and Y. Tassa. In*2012 IEEE/RSJ International Conference on Intelligent Robots and Systems*, pp. 5026–5033, Oct 2012.Bayesian Reinforcement Learning

Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart. , pp. 359–386.**Optimizing walking controllers for uncertain inputs and environments.**

Jack M. Wang, David J. Fleet, and Aaron Hertzmann. ACM Trans. Graph.**Real-time reinforcement learning by sequential actor-critics and experience replay.**

Pawel Wawrzynski. Neural Networks**Simple statistical gradient-following algorithms for connectionist reinforcement learning.**

Ronald J. Williams. Machine LearningRobust and Optimal Control

Kemin Zhou, John C. Doyle, and Keith Glover. .