# Contextual Policy Optimisation

###### Abstract

Policy gradient methods have been successfully applied to a variety of reinforcement learning tasks. However, while learning in a simulator, these methods do not utilise the opportunity to improve learning by adjusting certain environment variables: unobservable state features that are randomly determined by the environment in a physical setting, but that are controllable in a simulator. This can lead to slow learning, or convergence to highly suboptimal policies. In this paper, we present contextual policy optimisation (CPO). The central idea is to use Bayesian optimisation to actively select the distribution of the environment variable that maximises the improvement generated by each iteration of the policy gradient method. To make this Bayesian optimisation practical, we contribute two easy-to-compute low-dimensional fingerprints of the current policy. We apply CPO to a number of continuous control tasks of varying difficulty and show that CPO can efficiently learn policies that are robust to significant rare events, which are unlikely to be observable under random sampling but are key to learning good policies.

## 1 Introduction

Policy gradient methods have demonstrated remarkable success in learning policies for various continuous control tasks (Lillicrap et al., 2016; Mordatch et al., 2015; Schulman et al., 2016). However, the expense of running physical trials, coupled with the high sample complexity of these methods, pose significant challenges in directly applying them to a physical setting, e.g., to learn a locomotion policy for a robot. Another problem is evaluating the robustness of a learned policy; it is difficult to ensure that the policy performs as expected, as it is usually infeasible to test it across all possible settings. Fortunately, policies can often be learned and tested in a simulator that exposes key environment variables – state features that are unobservable to the agent and randomly determined by the environment in a physical setting, but that are controllable in the simulator.

A naïve application of a policy gradient method updates a policy at each iteration by using a batch of trajectories sampled from the original distribution over environment variables irrespective of the context, i.e., the current policy or the training iteration. Thus, it does not explicitly take into account how environment variables affect learning in different contexts. Furthermore, this approach is not robust to significant rare events (SREs), i.e., it fails any time there are rare events that substantially affect expected performance. Avoiding these problems requires learning off environment (Frank et al., 2008; Ciosek and Whiteson, 2017; Paul et al., 2018): exploiting the ability to adjust environment variables in simulation in order to improve the efficiency and robustness of learning.

For example, consider a quadruped that has to navigate through an environment to reach a goal as quickly as possible. If high velocity policies increase the probability of damage to the quadruped’s actuators, then the optimal policy should balance the reward gained by faster locomotion against the cost of potential damage to the robot. The naïve approach is bound to fail in this setting as the low probability of observing a damage, together with the high cost of such an event, yields extremely high variance gradient estimates. However, adjusting environment variables to trigger damage scenarios more often can enable fast learning of robust policies.

In this paper, we propose a new off environment approach called contextual policy optimisation (CPO) that aims to learn policies that are robust to rare events. At its core, CPO uses a policy gradient method as the policy optimiser. However, unlike the naïve approach, CPO explicitly models the effect of the environment variable on the policy updates, as a function of the context. Using Bayesian optimisation (BO), CPO actively selects the environment distribution at each iteration of the policy gradient method in order to maximise the improvement that one policy gradient update step generates. While this can yield biased gradient estimates, CPO implicitly optimises the bias-variance tradeoff in order to maximise its one-step improvement objective.

A key design challenge in CPO is how to represent the context, which includes the current policy, in cases where the policy is a large neural network with thousands of parameters. To this end, we propose two low-dimensional policy fingerprints to represent the context. The first approximates the stationary distribution over states induced by the policy, with a size equal to the dimensionality of the state space. The second approximates the policy’s marginal distribution over actions, with a size equal to the dimensionality of the action space.

We apply CPO to different continuous control tasks and show that it can outperform existing methods, including those for learning in environments with SREs. Our experiments also show that both fingerprints work equally well in practice, which implies that, for a given problem, the lower dimensional fingerprint can be chosen without sacrificing performance.

## 2 Problem Setting and Background

A Markov decision process (MDP) is a tuple , where is the state space, the set of actions, the transition probabilities, the reward function, the probability distribution over the initial state, and the discount factor. We assume that the transition and reward functions depend on some environmental variables . At the beginning of each episode, the environment randomly samples from some (known) distribution . The agent’s goal is to learn a policy mapping states to actions that maximises the expected return . With a slight abuse of notation, we use to denote both the policy and its parameters. We consider environments characterised by significant rare events (SREs), i.e., there exist some low probability values of that generate large magnitude returns (positive or negative), yielding a significant impact on .

We assume that learning is undertaken in a simulator, or under laboratory conditions where can be actively set. This is only mildly restrictive in practice. For example, it is typically possible to expose hidden variables like the coefficient of friction in a simulator, or deliberately limit the power of an actuator to simulate it being damaged.

### 2.1 Policy Gradient Methods

Starting with some policy at iteration , gradient based batch policy optimisation methods like REINFORCE (Williams, 1992), NPG (Kakade, 2001), and TRPO (Schulman et al., 2015) compute an estimate of the gradient by sampling a batch of trajectories from the environment while following policy , and then use this estimate to approximate gradient ascent in , yielding an updated policy . REINFORCE uses a fixed learning rate to update the policy; NPG uses the Fisher information matrix to scale the gradient, which makes the updates independent of the policy parameterisation; and TRPO adds a constraint on the KL divergence between consecutive policies.

A major problem for such methods is that the estimate of can have high variance due to stochasticity in the policy and environment, yielding slow learning (Williams, 1992; Glynn, 1990; Peters and Schaal, 2006). In settings with SREs, this problem is compounded by the variance due to , which the environment samples for each trajectory in the batch. Furthermore, the SREs may not be observed during learning since the environment samples , which can lead to convergence to a highly suboptimal policy. Since these methods do not explicitly consider the effect of the environment variable on learning, we refer to them as naïve approaches.

### 2.2 Bayesian Optimisation

A Gaussian process (GP) (Rasmussen and Williams, 2005) is a distribution over functions. It is fully specified by its mean (often assumed to be 0 for convenience) and covariance functions which encode any prior belief about the function. A posterior distribution is generated by using observed values to update the belief about the function in a Bayesian way. The squared exponential kernel is a popular choice for the covariance function, and has the form , where is a diagonal matrix whose diagonal gives lengthscales corresponding to each dimension of . By conditioning on the observed data, predictions for any new points can be computed analytically as a Gaussian :

(1a) | ||||

(1b) |

where is the vector of observed inputs, is the corresponding function values, and is the covariance matrix with elements . Probabilistic modelling of the predictions makes GPs well suited for optimising using BO. Given a set of observations, the next point for evaluation is chosen as the that maximises an acquisition function, which uses the posterior mean and variance to balance exploitation and exploration. This paper considers two acquisition functions: upper confidence bound (UCB) (Cox and John, 1992, 1997), and fast information-theoretic Bayesian optimisation (FITBO) (Ru et al., 2017).

Given a dataset , UCB directly incorporates the prediction uncertainty by defining an upper bound: , where controls the exploration-exploitation tradeoff. On the other hand, FITBO aims to reduce the uncertainty about the global optimum by selecting the that minimises the entropy of the distribution . The acquisition function is given by:

which is intractable and requires approximations to compute efficiently.

BO mainly minimises simple regret: The acquisition function suggests the next point for evaluation at each timestep, but then the algorithm suggests what it believes to be the optimal point , and the regret is defined . This is different from a bandit setting where the cumulative regret is defined as . Krause and Ong (2011) showed that the UCB acquisition function is also a viable strategy to minimise cumulative regret in a contextual GP bandit setting, where selection of conditions on some observed context.

## 3 Contextual Policy Optimisation

To address the challenges posed by environments with SREs, we introduce contextual policy optimisation (CPO). The main idea is to actively select a distribution for conditioned on the context, i.e., the current policy, such that it helps the policy optimisation routine learn a policy that takes into account any SREs. Concretely, CPO executes the following steps at each iteration . First, it selects by approximately solving an optimisation problem defined below. Second, it samples trajectories from the environment using the current policy , where each trajectory uses a value for sampled from a distribution parameterised by : . Third, these trajectories are fed to a policy optimisation routine, e.g., a policy gradient algorithm, which uses them to compute an updated policy, . Fourth, new, independent trajectories are generated from the environment with and used to estimate . Fifth, a new point is added to a dataset , which is input to a GP. The process then repeats, with BO using the GP to select the next .

The key insight behind CPO is that at each iteration , CPO should select the that it expects will maximise the performance of the next policy, :

(2) |

In other words, CPO chooses the that it thinks will help PolOpt maximise the improvement to that can be made in a single policy update. By modelling the relationship between , , and with a GP, CPO can learn from experience how to select an appropriate for the current . Modelling directly also bypasses the issue of modelling and its relationship to , which is infeasible when is high dimensional. While PolOpt has inputs , the optimisation is performed over only, while is fixed and forms the context, hence the name of our method. Figure 4 illustrates CPO, and is summarised in Algorithm 1. In the remainder of this section, we describe in more detail elements of CPO that are essential to make it work in practice.

### 3.1 Selecting

The optimisation problem in (2) is difficult for two reasons. First, solving it requires calling PolOpt, which is expensive in both computation and samples. Second, the observed can be noisy due to the inherent stochasticity in the policy and the environment.

BO is particularly suited to such settings as it is sample efficient, gradient free, and can work with noisy observations. In this paper, we consider both the UCB and FITBO acquisition functions to select in (2), and compare their performance. Formally, we model the returns as a GP with two inputs . Given a dataset , is selected by maximising the UCB or FITBO acquisition function:

(3) | ||||

(4) |

Here we have dropped the from all the conditioning sets for ease of notation.

Estimating the gradient estimates using trajectories sampled from the environment with introduces bias. While most importance sampling (IS) based methods (e.g., Frank et al. (2008); Ciosek and Whiteson (2017)) could correct for this bias, CPO does not explicitly do so. Instead CPO lets BO implicitly optimise a bias-variance tradeoff by selecting to maximise the one-step improvement objective.

### 3.2 Estimating

Estimating accurately in the presence of SREs can be challenging. A Monte Carlo estimate using samples of trajectories from the original environment requires a prohibitively large number of samples. One alternative would be to apply an IS correction to the trajectories generated from for the policy optimisation routine. However, this is not feasible since it would require computing the IS weights , which depend on the unknown transition function. Furthermore, even if the transition function was known, there is no reason why should yield a good IS distribution since it is selected with the objective of maximising .

Instead, CPO applies exhaustive summation for discrete and numerical quadrature for continuous to estimate . That is, if the support of is discrete, we simply sample a trajectory from each environment defined by and estimate . To reduce the variance due to stochasticity in the policy and the environment, we can sample multiple trajectories from each . Since in practice is usually no more than two dimensional, for continuous we apply an adaptive Gauss-Kronrod quadrature rule to estimate .

### 3.3 Policy Fingerprints

True global optimisation is limited by the curse of dimensionality to low-dimensional inputs, and BO has had only rare successes in problems with more than twenty dimensions (Wang et al., 2013). In CPO, many of the inputs to the GP are policy parameters: in practice, the policy is a neural network that may have thousands of parameters, far too many for a GP. Thus, we need to develop a policy fingerprint, i.e., a representation that is low dimensional enough to be treated as an input to the GP but expressive enough to distinguish the policy from others.

Foerster et al. (2017) showed that a surprisingly simple fingerprint, consisting only of the training iteration, suffices to stabilise multi-agent -learning. Using the training iteration alone as the fingerprint proves insufficient for CPO, as the GP fails to model the response surface and treats all observed as noise. However, the principle still applies: a simplistic fingerprint that discards much information about the policy can still provide sufficient context for decision making, in this case to select .

In this spirit, we propose two fingerprints. The first, the state fingerprint, augments the training iteration with an estimate of the stationary state distribution induced by the policy. In particular, we fit an anisotropic Gaussian to the set of states visited in the trajectories sampled while estimating (see Section 3.2). The size of this fingerprint grows linearly with the dimensionality of the state space, instead of the number of parameters in the policy.

In many settings, the state space is high dimensional, but the action space is low dimensional. Therefore, our second fingerprint, the action fingerprint, is a Gaussian approximation of the marginal distribution over actions induced by the policy: , sampled from trajectories as with the state fingerprint.

Of course, neither the stationary state distribution nor the marginal action distribution are likely to be Gaussian and could in fact be multimodal. Furthermore, the state distribution is estimated from samples used to estimate , and not from . However, as our results show, these representations are nonetheless effective, as they do not need to accurately describe each policy, but instead just serve as low dimensional fingerprints on which CPO conditions.

### 3.4 Covariance Function

Our choice of policy fingerprints means that one of the inputs to the GP is a probability distribution. Thus for our GP prior we use a product of three covariance functions, , where each of , and is a squared exponential covariance function and is the state or action fingerprint of . Similar to Malkomes et al. (2016), we use the Hellinger distance to replace the Euclidean in : this covariance remains positive-semi-definite as the Hellinger is effectively a modified Euclidean.

## 4 Related Work

Various methods have been proposed for learning in the presence of SREs. These are usually off environment and either based on learning a good IS distribution from which to sample the environment variable (Frank et al., 2008; Ciosek and Whiteson, 2017), or Bayesian active selection of the environment variable during learning (Paul et al., 2018).

Frank et al. (2008) propose a temporal difference based method that uses IS for efficiently evaluating policies whose expected value may be substantially affected by rare events. However, their method assumes prior knowledge of the SREs, such that they can directly alter the probability of such events during policy evaluation. By contrast, CPO does not require any such prior knowledge about SREs, or the environment variable settings that might trigger them. It only assumes that the original distribution of the environment variable is known, and that the environment variable is controllable during learning.

OFFER (Ciosek and Whiteson, 2017) is a policy gradient method based algorithm that uses observed trials to gradually changes the IS distribution over the environment variable. Like CPO, it makes no prior assumptions about SREs. However, at each iteration it updates the environment distribution with the objective of minimising the variance of the gradient estimate, which may not lead to the distribution that optimises the learning of the policy. A further disadvantage of the method is that it requires a full transition model of the environment to compute the IS weights. It can also lead to unstable IS estimates if the environment variable affects any transitions besides the initial state.

ALOQ (Paul et al., 2018) is a Bayesian optimisation and quadrature based method that models the return as a GP with the policy parameters and environment variable as inputs. At each iteration it actively selects the policy and then the environment variable in an alternating fashion and as such, performs the policy search in the parameter space. Being a BO based method, it does not make any assumption of the Markov property of the environment and is highly sample efficient. However, it can only be applied to settings with low dimensional policies. Furthermore, its computational cost scales cubically with the number of iterations, and is thus limited to settings where a good policy can be found within relatively few iterations. In contrast to this, CPO uses a policy gradient method to perform policy optimisation while the BO component generates trajectories, which when used by the policy optimiser are expected to lead to a larger improvement in the policy.

In the wider BO literature, Williams et al. (2000) suggested a method for settings where the objective is to optimise expensive integrands. However, their method does not specifically consider the impact of SREs and, as shown by Paul et al. (2018), are unsuitable for such settings. Toscano-Palmerin and Frazier (2018) suggest BQO, another BO based method for expensive integrands. Their method also does not explicitly consider SREs. Finally, both these methods suffer from all the disadvantages of BO based methods mentioned earlier.

Rajeswaran et al. (2017) propose a different, off-environment approach for cases where the simulator settings can have a significant impact on policy performance. Their algorithm, EPOpt(), seeks to learn robust policies by maximising the -percentile conditional value at risk (CVaR) of the policy. First, it randomly samples a set of simulator settings; then trajectories are sampled for each of these settings. A policy optimisation routine (e.g., TRPO (Schulman et al., 2015)) is then used to update the policy based on only those trajectories with returns lower than the percentile in the batch. A fundamental difference to CPO is that it finds a risk-averse solution based on CVaR, while CPO finds a risk neutral policy. Also, while CPO actively changes the distribution for sampling the environment variable at each iteration, EPOpt samples them from the original distribution, and is thus unlikely to be suitable for settings with SREs, since it will not generate them often enough to to learn an appropriate response. Finally, EPOpt discards all sampled trajectories in the batch with returns greater than percentile for use by the policy optimisation routine, making it highly sample inefficient, especially for low values of .

RARL proposed by Pinto et al. (2017) also seeks to learn robust policies by training in a simulator where an adversary applies destabilising forces with both the agent and the adversary being trained simultaneously. RARL requires significant prior knowledge in setting up the adversary to ensure that it strikes a balance between making the environment so difficult that the agent in unable to learn, and making it so easy that the policy learnt by the agent isn’t robust enough. Also, like EPOpt it does not consider any settings with SREs.

By learning an optimal distribution for the environment variable conditioned on the policy fingerprint, CPO also has some parallels with meta-learning. Methods like MAML (Finn et al., 2017), and Reptile (Nichol et al., 2018) seek to find a good policy representation that can be adapted quickly to a specified task. Andrychowicz et al. (2016); Chen et al. (2017) seek to optimise neural networks by learning an automatic update rule based on transferring knowledge from similar optimisation problems. To maximise the performance of a neural network across a set of discrete tasks, Graves et al. (2017) propose a method for automatically selecting a curriculum during learning. Their method treats the problem as a multi-armed bandit and use the Exp3 algorithm (Auer et al., 2002) to find the optimal curriculum. Unlike these methods, which seek to quickly adapt to a new task after training on some related task, CPO seeks to maximise the expected return across a variety of tasks.

## 5 Experiments

To evaluate the empirical performance of CPO, we start by applying it to a simple problem: a modified version of the cliff walker task (Sutton and Barto, 1998), with one dimensional state and action spaces. We then move on to simulated robotics problems based on the MuJoCo simulator (Brockman et al., 2016) with much higher dimensionalities. These were modified to include SREs. We aim to answer two questions: (1) How do the different versions of CPO (UCB vs. FITBO acquisition functions, state (S) vs. action (A) fingerprints) compare with each other? (2) How does CPO compare to existing methods (Naïve, OFFER, EPOpt, ALOQ), and ablated versions of CPO. We repeat all our experiments across 10 random starts, and present the median (solid line) and quartiles (shaded region) in the plots. We use TRPO as the policy optimisation method.

Due to the disadvantages of ALOQ mentioned in Section 4, we were able to apply it only on the cliff walker problem. The policy dimensionality and the total number of iterations for the simulated robotic tasks were far too high. Note also that, while we compare CPO to EPOpt, these methods optimise for different objectives.

### 5.1 Cliff Walker

We start with a toy problem: a modified version of the cliff walker problem where instead of a gridworld we consider an agent moving in a continuous state space ; the agent starts randomly near the state 0, and at each timestep can take an action . The environment then transitions the agent to a new location , where is standard Gaussian noise. The location of the cliff is given by , where follows a Beta distribution. If the agent’s current state is lower than the cliff location, it gets a reward equal to its state; otherwise it falls off the cliff and gets a reward of -50000, terminating the episode. Thus the objective is to learn to walk as close to the cliff edge as possible without falling over.

Figure 7 shows that both the UCB and FITBO acquisition functions versions of CPO using the state fingerprint do well on the task. However, the action fingerprint has a much higher variance in performance and, after finding a good policy initially, with FITBO the performance drops. This is because the Gaussian approximation of the action fingerprint in this setting is not ideal due to the application of the sign function in the transition.

Figure 7 shows that not only does CPO-UCB(S) learns a policy with a higher expected return, it also has a much lower variance than all other baselines, except for ALOQ. This is not surprising since, as discussed in Section 2.1, the gradient estimates without active selection of are likely to have high variance due to the presence of SREs. For EPOpt we set and performed rejection sampling after 150 iterations. The poor performance of ALOQ is expected since even in this simple problem, the policy dimensionality is 47, which is still quite high for a BO based method. We could not run OFFER since an analytical solution of does not exist.

### 5.2 Half Cheetah

Next we consider simulated robotic locomotion tasks using the Mujoco simulator. In the half cheetah task shown in Figure 11, the objective is to learn a locomotion policy for the planar robot with two legs. We modified the original problem such that in 98% of the cases the objective of the agent is to achieve a target velocity of 2, with the rewards decreasing linearly for being away from the target. In the remaining 2%, the target velocity is set to 4, with a large bonus reward, which acts as an SRE.

Figure 11 shows that CPO with the UCB acquisition function outperforms the FITBO acquisition function. We suspect that this is because FITBO tends to over-explore. Also, as mentioned in Section 2.2, it was developed with the aim of minimising simple regret and it is not known how efficient it is at minimising cumulative regret. Unlike in cliff walker, both the action and state fingerprints perform equally well with the UCB acquisition function since the actions are truly continuous.

Figure 11 shows that the Naïve method and OFFER, and random selection of all converge to a locally optimal policy that completely ignores the SRE. We set for EPOpt, but in this case we use the trajectories with returns exceeding the threshold for the policy optimisation since the SRE has a large positive return. Although its performance increases after iteration 4,000, it is extremely sample inefficient, requiring about five times the samples of CPO. We did not run ALOQ as it is entirely infeasible given the policy dimensionality is more than 20,000.

### 5.3 Ant

The ant environment shown in Figure 15 is a much more difficult problem than the half cheetah since the agent now moves in 3D, and the state space has 111 dimensions compared to 17 for half cheetah. The larger state space also makes learning difficult due to the higher variance in the gradient estimates. We modified the original problem such that velocities greater than 2 carries a 5% chance of damage to the ant. On incurring damage, which we treat as the SRE, the agent receives a large negative reward, and the episode terminates.

In 15 we compare the performance of the UCB versions of CPO. We did not include the FITBO versions since it is not competitive with UCB as seen in the half cheetah experiment. We see that there is no significant difference between the performance of the state or action fingerprint. Thus we see that the lower dimensional fingerprint (in this case the action fingerprint) can be chosen without any significant drop in performance.

In Figure 15 we see that for the Naïve method and EPOpt the performances drops significantly after about a 1000 iterations. This is because the policies being learned by these methods lead to velocities beyond 2, which after factoring in the effect of the SRE, leads to much lower expected returns. Since the SREs are not seen often enough, these methods do not learn that higher velocities do not lead to higher expected returns. The random baseline tends to perform better since it sees these SREs during learning. However, CPO outperforms it due to the active selection of . We could not run OFFER in this setting as computing the IS weights in this setting would require knowledge of the transition model.

## 6 Conclusion

In this paper we presented CPO, a method based on the insight that active selection of the environment variable during learning can lead to policies that take into account the effect of SREs. We introduced novel state and action fingerprints that can be used by BO with a one-step improvement objective to make CPO scalable high dimensional tasks irrespective of the policy dimensionality. We appled CPO to a number of continuous control tasks of varying difficulty and showed that CPO can efficiently learn policies that are robust to significant rare events, which are unlikely to be observable under random sampling but are key to learning good policies. In the future we would like to develop fingerprints for discrete state and action spaces, and explore using a multi-step improvement objective for the BO component.

## Acknowledgements

We would like to thank Binxin Ru for sharing the code for FITBO, and Yarin Gal for the helpful discussions. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreements #637713).

## References

- Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gómez, S., Hoffman, M. W., Pfau, D., Schaul, T., and de Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In In Neural Information Processing Systems (NIPS).
- Auer et al. (2002) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77.
- Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym.
- Chen et al. (2017) Chen, Y., Hoffman, M. W., Colmenarejo, S. G., Denil, M., Lillicrap, T. P., Botvinick, M., and de Freitas, N. (2017). Learning to learn without gradient descent by gradient descent. In In International Conference on Machine Learning (ICML).
- Ciosek and Whiteson (2017) Ciosek, K. and Whiteson, S. (2017). Offer: Off-environment reinforcement learning. In AAAI Conference on Artificial Intelligence.
- Cox and John (1992) Cox, D. D. and John, S. (1992). A statistical method for global optimization. In IEEE International Conference on Systems, Man and Cybernetics.
- Cox and John (1997) Cox, D. D. and John, S. (1997). Sdo: A statistical method for global optimization. In in Multidisciplinary Design Optimization: State-of-the-Art.
- Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML).
- Foerster et al. (2017) Foerster, J., Nardelli, N., Farquhar, G., Torr, P., Kohli, P., and Whiteson, S. (2017). Stabilising experience replay for deep multi-agent reinforcement learning. In In International Conference on Machine Learning (ICML).
- Frank et al. (2008) Frank, J., Mannor, S., and Precup, D. (2008). Reinforcement learning in the presence of rare events. In International Conference on Machine Learning (ICML).
- Glynn (1990) Glynn, P. W. (1990). Likelihood ratio gradient estimation for stochastic systems. Commun. ACM.
- Graves et al. (2017) Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. (2017). Automated curriculum learning for neural networks. CoRR, abs/1704.03003.
- Kakade (2001) Kakade, S. (2001). A natural policy gradient. In In Neural Information Processing Systems (NIPS).
- Krause and Ong (2011) Krause, A. and Ong, C. S. (2011). Contextual gaussian process bandit optimization. In Neural Information Processing Systems (NIPS).
- Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). In In International Conference on Learning Representations (ICLR).
- Malkomes et al. (2016) Malkomes, G., Schaff, C., and Garnett, R. (2016). Bayesian optimization for automated model selection. In In Neural Information Processing Systems (NIPS).
- Mordatch et al. (2015) Mordatch, I., Lowrey, K., Andrew, G., Popovic, Z., and Todorov, E. V. (2015). Interactive control of diverse complex characters with neural networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 3132–3140. Curran Associates, Inc.
- Nichol et al. (2018) Nichol, A., Achiam, J., and Schulman, J. (2018). On first-order meta-learning algorithms. CoRR, abs/1803.02999.
- Paul et al. (2018) Paul, S., Chatzilygeroudis, K., Ciosek, K., Mouret, J.-B., Osborne, M., and Whiteson, S. (2018). Alternating optimisation and quadrature for robust control. In AAAI Conference on Artificial Intelligence.
- Peters and Schaal (2006) Peters, J. and Schaal, S. (2006). Policy gradient methods for robotics. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.
- Pinto et al. (2017) Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. (2017). Robust adversarial reinforcement learning.
- Rajeswaran et al. (2017) Rajeswaran, A., Ghotra, S., Levine, S., and Ravindran, B. (2017). Epopt: Learning robust neural network policies using model ensembles. International Conference on Learning Representations (ICLR).
- Rasmussen and Williams (2005) Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
- Ru et al. (2017) Ru, B., McLeod, M., Granziol, D., and Osborne, M. (2017). Fast Information-theoretic Bayesian Optimisation. ArXiv e-prints.
- Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning (ICML).
- Schulman et al. (2016) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR).
- Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning : An Introduction. MIT Press.
- Toscano-Palmerin and Frazier (2018) Toscano-Palmerin, S. and Frazier, P. I. (2018). Bayesian Optimization with Expensive Integrands. ArXiv e-prints.
- Wang et al. (2013) Wang, Z., Zoghi, M., Hutter, F., Matheson, D., and De Freitas, N. (2013). Bayesian Optimization in High Dimensions via Random Embeddings. In IJCAI, pages 1778–1784.
- Williams et al. (2000) Williams, B. J., Santner, T. J., and Notz, W. I. (2000). Sequential design of computer experiments to minimize integrated response functions. Statistica Sinica.
- Williams (1992) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning.