Using Parameterized BlackBox Priors
to Scale Up ModelBased Policy Search for Robotics
Abstract
The most dataefficient algorithms for reinforcement learning in robotics are modelbased policy search algorithms, which alternate between learning a dynamical model of the robot and optimizing a policy to maximize the expected return given the model and its uncertainties. Among the few proposed approaches, the recently introduced BlackDROPS algorithm exploits a blackbox optimization algorithm to achieve both high dataefficiency and good computation times when several cores are used; nevertheless, like all modelbased policy search approaches, BlackDROPS does not scale to high dimensional state/action spaces. In this paper, we introduce a new model learning procedure in BlackDROPS that leverages parameterized blackbox priors to (1) scale up to highdimensional systems, and (2) be robust to large inaccuracies of the prior information. We demonstrate the effectiveness of our approach with the “pendubot” swingup task in simulation and with a physical hexapod robot (48D state space, 18D action space) that has to walk forward as fast as possible. The results show that our new algorithm is more dataefficient than previous modelbased policy search algorithms (with and without priors) and that it can allow a physical 6legged robot to learn new gaits in only 16 to 30 seconds of interaction time.
I Introduction
Robots have to face the real world, in which trying something might take seconds, hours, or even days [1]. Unfortunately, the current stateoftheart learning algorithms (e.g., deep learning [2]) either rely on the availability of very large data sets (e.g., 1.2 millions labeled images in the ImageNet database [3]) or only make sense in simulated environments (e.g., 38 days of learning for Atari games [4]). This scarcity of data calls for algorithms that are highly dataefficient, that is, that minimize the interaction time between the robot and the world, even if it means a considerable computation cost.
In reinforcement learning for robotics, the most dataefficient algorithms are modelbased policy search algorithms [5, 6]: after each episode, the algorithm updates a model of the dynamics of the robot, then it searches for the best policy according to the model. To improve the dataefficiency, the current algorithms take the uncertainty of the model into account in order to avoid overfitting the model [7, 8]. The PILCO algorithm [7] implements these ideas, but (1) it imposes several constraints on the reward functions and policies (because it needs to compute gradients analytically), and (2) it is a slow algorithm that cannot benefit from multicore computers (typically about an hour to complete 15 episodes on the cartpole benchmark) [8].
The recently introduced BlackDROPS algorithm [8] is one of the first modelbased policy search algorithms for robotics that is purely blackbox and can extensively take advantage of parallel computations. BlackDROPS achieves similar dataefficiency to stateoftheart approaches like PILCO (e.g., less than s of interaction time to solve the cartpole swingup task), while being faster on multicore computers, easier to set up, and much less limiting (i.e., it can use any policy and/or reward parameterization; it can even learn the reward model).
However, while BlackDROPS scales well with the number of processors, the main challenge of modelbased policy search is scaling up to complex problems: as the algorithm models the transition function between full state/action spaces (joint positions, environment, joint velocities, etc.), the complexity of the model increases substantially with each new degree of freedom; unfortunately, the quantity of data required to learn a good model scales most of the time exponentially with the dimension of the state space [9]. As a consequence, the dataefficiency of modelbased approaches greatly suffers from the increase of the dimensionality of the model. In practice, modelbased policy search algorithms can currently be employed only with simple systems up to 1015D state and action space combined (e.g., double cartpole or a simple manipulator).
One way of tackling the problem raised by the “curse of dimensionality” is to use prior information about the system that is modeled; for instance, dynamic simulators of the robot can be effective priors and are often available. The ideal modelbased policy search algorithm with priors for robotics should, therefore:

scale to high dimensional and complex robots (e.g., walking or soft robots);

take advantage of multicore architectures to speedup computation times;

perform the search in the full policy space (i.e., the more real trials, the better expected reward);

make as few assumptions as possible about the type of robot and the prior information (i.e., require no specific structure or differentiable models);

be able to select among several prior models or to tune the prior model.
A few algorithms leverage prior information to speedup learning on the real system [10, 11, 12, 13, 14, 15], but none of them fulfills all of the above properties. In this paper, we propose a novel, purely blackbox, flexible and dataefficient modelbased policy search algorithm that combines ideas from the BlackDROPS algorithm, from simulationbased priors, and from recent model learning algorithms [16, 17]. We show that our approach is capable of learning policies in about 30 seconds to control a damaged physical hexapod robot (48D state space, 18D action space) and outperforms stateoftheart modelbased policy search algorithms without (PILCO [7], BlackDROPS [8]) and with priors (PILCO with priors [10]), as well as priorbased Bayesian optimization (IT&E [14]).
Ii Background
Iia Policy Search for Robotics
Modelfree policy search (PS) methods have been successful in robotics as they can easily be applied in highdimensional continuous stateaction RL problems [5, 18, 19]. The PoWER algorithm [20] uses probabilityweighted averaging, which has the property of following the natural gradient without computing it. The PI [21] algorithm has very similar performance with PoWER, but puts no constraint on the reward function. Natural Evolution Strategies (NES) [22] and Covariance Matrix Adaptation ES (CMAES) [23] families of algorithms are populationbased blackbox optimizers that iteratively update a search distribution by calculating an estimated gradient on the distribution parameters (mean and covariance). At each generation, they sample a set of policy parameters and rank them based on their expected return. NES performs gradient ascent along the natural gradient, whereas CMAES updates the distribution by exploiting the technique of evolution paths.
Although, modelfree policy search methods are promising, they require a few hundreds or thousands of episodes to converge to good solutions [5, 6]. The dataefficiency of such methods can be increased by learning the model (i.e., transition and reward function) of the system from data and inferring the optimal policy from the model [5, 6]. For example, stateoftheart modelfree policy gradient methods (e.g., TRPO [19] or DDPG [18]) require more than of interaction time to solve the cartpole swingup task [18] whereas stateoftheart modelbased policy search algorithms (e.g., PILCO or BlackDROPS) require less than [8, 7]. Probabilistic models have been more successful than deterministic ones, as they provide an estimate about the uncertainty of their approximation which can be incorporated into longterm planning [7, 8, 6, 5].
BlackDROPS [8] and PILCO [7] are two of the most dataefficient modelbased policy search algorithms for robot control. They essentially differ in how they use the uncertainty of the model and in how they optimize the policy given the model: PILCO uses moment matching and analytical gradients [7], whereas BlackDROPS uses MonteCarlo rollouts and a blackbox optimizer.
BlackDROPS adds two main benefits to PILCO: (1) any reward function or policy parameterization can be used (including nondifferentiable policies like finite automata), and (2) it is a highlyparallel algorithm that takes advantages of multicore computers. BlackDROPS achieves similar dataefficiency to PILCO and escapes local optima faster in standard control benchmarks (inverted pendulum and cartpole swingup) [8]. It was also able to learn from scratch a high dimensional policy (neural network with 134 parameters) in only 56 trials on a physical lowcost manipulator [8].
IiB Accelerating Policy Search using Priors
Modelbased policy search algorithms reduce the required interaction time, but for more complex or higher dimensional systems, they still require dozens or even hundreds of episodes to find a working policy; in some systems, they might also fail to find any good policy because of the inevitable model errors and biases [24].
One way to reduce the interaction time without learning models is to begin with a meaningful initial policy (coming from demonstration or simulation) and then search locally to improve it. Usually this is done by human demonstration and movement primitives [25]: a human either teleoperates or moves the robot by hand trying to achieve the task and then a modelfree RL method is applied to improve the initial policy [20, 26]. However, these approaches still suffer from the data inefficiency of modelfree approaches and require dozens or hundreds of episodes to find good policies.
Another way to reduce the interaction time in modelfree approaches is to precompute archives/libraries of policies/controllers [27, 28] and then search online for the one that works best on the real system [14, 29]. The Intelligent TrialandError (IT&E) algorithm [14] first uses an evolutionary algorithm called MAPElites [30, 31] offline to create an archive of diverse and locally highperforming behaviors and then utilizes a modified version of Bayesian optimization (BO) [32] to quickly find a compensatory behavior. Although IT&E can allow, for instance, a damaged 6legged robot to find a new gait in about a dozen trials (less than 2 minutes) and a robotic arm to overcome several blocked joints in a few minutes, it is not searching in the full policy space and as such there is no guarantee that the optimal policy can be found.
Reducing the interaction time in modelbased policy search can be achieved by using priors on the models [10, 11, 12, 13, 33]; i.e., starting with an initial guess of the dynamics and then learning the residual model. PILCO with priors [10] and PIREM [12] are closely related as they both use the policy search procedure of PILCO. PILCO with priors uses simulated data to create a Gaussian process prior, whereas PIREM uses analytic equations for the prior model. The main limitation of PILCO with priors is that it implicitly requires the task to be solved in the prior model with PILCO (in order to get the speedup shown in the original paper [10]). GPILQG [11] also learns the residual model like PIREM and then uses a modified version of ILQG [34] to find a policy given the uncertainties of the model. GPILQG, however, requires the prior model to be differentiable.
IiC Model Identification and Learning
The traditional way of exploiting analytic equations is model identification [35]. Most approaches for model identification rely on two main ingredients: (a) proper excitation of the system [36, 35, 37] and (b) parametric models. Recently, Xie et. al. [38] proposed a method that combines model identification and RL. More specifically, their approach relies on a Model Predictive Control (MPC) scheme with optimistic exploration on a parametric model that is estimated from the collected data using leastsquares.
However, these approaches assume that the analytical equations can fully capture the system, which is often not the case when dealing with unforeseen effects like, for example, complex friction effects or when there exists severe model mismatch (i.e., no parameters can explain the data) like, for instance, when the robot is damaged.
A few methods have been proposed to combine model identification and model learning [16, 17]. Nevertheless, these methods are based on the manipulator equation exploiting it in different ways and it is not straightforward how they can be used with more complicated robots that involve complex collisions and contacts (e.g., walking or complex soft robots).
Iii Problem Formulation
We consider dynamical systems of the form:
(1) 
with continuousvalued states and controls , i.i.d. Gaussian system noise , and unknown transition dynamics . We assume that we have an initial guess of the dynamics, the function , that may not be accurate either because we do not have a very precise model of our system (i.e., what is called the “realitygap” [39]) or because the robot is damaged in an unforeseen way (e.g., a blocked joint or faulty motor/encoder) [14, 40].
Contrary to previous works [11, 16, 17], we assume no structure or specific properties of our initial dynamics model (i.e., we treat it as a blackbox function), other than it has some tunable parameters, , which change its behavior. Examples of these parameters can be some optimization parameters (e.g., type of optimizer) of a dynamic simulator involving contacts and collisions or some internal parameters of the robot (e.g., masses of the bodies). Finally, we add a nonparametric model, (with associated hyperparameters ), to model whatever is not possible to capture with :
(2) 
Our objective is to find a deterministic policy , that maximizes the expected longterm reward when following policy for time steps:
(3) 
where is the immediate reward of being in state . We assume that is a function parameterized by .
In modelbased policy search with priors, we begin by optimizing the policy on the prior model (that is, there is no prior information on the policy parameters) and applying it on the real system to gather the initial data. Afterwards, a loop is iterated where we first learn a model using the prior model and the collected data and then optimize the policy given this newly learned model (Algo. 1). Finally, the policy is applied on the real system, more data is collected and the loop reiterates until the task is solved.
Iv Approach
Iva Gaussian processes with the simulator as the mean function
We would like to have a model that approximates as accurately as possible the unknown dynamics of our system given some initial guess, . We rely on Gaussian processes (GPs) to do so as they have been successfully used in many modelbased reinforcement learning approaches [7, 8, 41, 42, 5, 40, 6]. A GP is an extension of the multivariate Gaussian distribution to an infinitedimension stochastic process for which any finite combination of dimensions will be a Gaussian distribution [43].
As inputs, we use tuples made of the state vector and the action vector , that is, ; as training targets, we use the difference between the current state vector and the next one: . We use independent GPs to model each dimension of the difference vector . Assuming is a set of observations and being the simulator function (i.e., our initial guess of the dynamics — tunable or not; we drop the parameters here for brevity), we can query the GP at a new input point :
(4) 
The mean and variance predictions of this GP are computed using a kernel vector , and a kernel matrix , with entries :
(5) 
The formulation above allows us to combine observations from the simulator and the realworld smoothly. In areas where realworld data is available, the simulator’s prediction will be corrected to match the realworld ones. On the contrary, in areas far from realworld data, the predictions resort to the simulator [14, 11, 40].
This model learning procedure has been used in several articles [33, 16, 42] and in particular to learn the cumulative reward model for a BO procedure highlighted in the IT&E approach [14]. GPILQG [11] and PIREM [12] formulate a similar model learning procedure for optimal control (under model uncertainty) and policy search respectively. GPILQG additionally assumes that the prior model is differentiable, which is not always true and might be too slow to perform via finite differences (e.g., when using blackbox simulators for ). PILCO with priors [10] utilizes a similar scheme but assumes that the prior model is a GP learned from simulation data that is gathered from running PILCO on the prior system.
We use the exponential kernel with automatic relevance determination [43] ( are the kernel hyperparameters). When searching for the best kernel hyperparameters through Maximum Likelihood Estimation (MLE) for a GP with a nontunable mean function , we seek to maximize [43]:
(6) 
The gradients of this likelihood function can be analytically computed, which makes it possible to use any gradient based optimizer (we use Rprop [44]). Since we have independent GPs, we have independent optimizations. We use the limbo C++11 library for GP regression [45].
IvB Mean functions with tunable parameters
We would like to use a mean function , where each vector corresponds to a different prior model of our system (e.g., different lengths of links). Searching for the that best matches the observations can be seen as a model identification procedure, which could be solved via minimizing the mean squared error; nevertheless, the GP framework allows us to jointly optimize for the kernel hyperparameters and the mean parameters, which allows the modeling procedure to balance between nonparametric and parametric modeling. We can easily extend Eq. (IVA) to include parameterized mean functions:
(7) 
This time, even though we have independent GPs (one for each output dimension), all of them need to share the same mean parameters (contrary to the kernel parameters, which are typically different for each dimension), because the model of the robot should be consistent in all of the output dimensions. Thus, we have to jointly optimize for the mean parameters and the kernel hyperparameters of all the GPs. Since most dynamic simulators are not differentiable (or too slow to differentiate by finite differences), we cannot resort to gradientbased optimization to optimize Eq. (IVB) jointly for all the GPs. A blackbox optimizer like CMAES [23] could be employed instead, but this optimization was too slow to converge in our preliminary experiments.
To combine the benefits of both gradientbased and gradientfree optimization, we use gradientbased optimization for the kernel hyperparameters (since we know the analytical gradients) and blackbox optimization for the mean parameters. Conceptually, we would like to optimize for the mean parameters, , given the optimal kernel hyperparameters for each of them. Since we do not know them beforehand, we use two nested optimization loops: (a) an outer loop where a gradientfree local optimizer searches for the best parameters (we use a variant of the Subplex algorithm [46] provided by NLOpt [47] for continuous spaces and exhaustive search for discrete ones), and (b) an inner optimization loop where given a mean parameter vector , a gradientbased optimizer searches for the best kernel hyperparameters (each GP is independently optimized since is fixed in the inner loop) and returns a score that corresponds to for the optimal (Algo. 2).
One natural way of combining the likelihoods of the independent GPs to form the objective function of the outer loop is to take the product, which would be equivalent to taking the joint probability of the likelihoods of the independent GPs (since the likelihood is a probability density function). However, we observed that taking the sum or the harmonic mean of the likelihoods instead yielded more robust results. This comes from the fact that the product can be dominated by a few terms only and thus if some parameters explain one output dimension perfectly and all the others not as well it would still be chosen. In addition, in practice we observed that taking the sum of the likelihoods proved to be numerically more stable than the harmonic mean.
Our model learning approach, which we call GPMI (Gaussian Process Model Identification), that combines nonparametric model learning and parametric model identification is related to the approach in [16], but there are some key differences between them. Firstly, the model learning procedure in [16] depends on the manipulator equation and cannot easily be used with robots that do not directly comply to the equation (one example would be the hexapod robot in our experiments or a soft robot with complex dynamics), whereas GPMI imposes no structure on the prior model, other than providing some tunable parameters (continuous or discrete). Furthermore, the approach in [16] is tied to inverse dynamics models and cannot be used with forward models in the general case (necessary for longterm forward predictions); on the contrary, GPMI can be used with inverse or forward dynamics models and in general with any blackbox tunable prior model.
IvC Policy Search with the BlackDROPS algorithm
We use the BlackDROPS [8] algorithm for policy search because it allows us to use the type of priors discussed in Section IVB and to leverage specific policy parameterizations that are suitable for different cases (e.g., we use a neural network policy for the pendubot task and an openloop periodic policy for the hexapod). We assume no prior information on the policy parameters and we begin by optimizing the policy on the prior model. Moreover, we took advantage of multicore architectures to speedup our experiments. Contrary to BlackDROPS, PILCO [7] cannot take advantage of multiple cores^{1}^{1}1For reference, each run of PILCO with priors (26 episodes + model learning) in the pendubot task took around 70 hours on a modern computer with 16 cores, whereas each run of BlackDROPS with priors and BlackDROPS with GPMI took around 15 hours and 24 hours respectively. and the need for deriving all the gradients for a different policy/reward makes it difficult (or even impossible) to try new ideas/policies.
To take the uncertainties of the model into account, the core idea of BlackDROPS is to avoid to compute the expected reward of policy parameters, which is what most approaches do and is usually either computationally expensive [48] or requires some approximation to be made [7]. Instead it treats each MonteCarlo rollout as a noisy measurement of a function that is the actual function perturbed by a noise and tries to maximize its expectation:
(8) 
We assume that for all and therefore maximizing is equivalent to maximizing (see Eq. (3)). The second main idea of BlackDROPS, is to use a populationbased blackbox optimizer that (1) can optimize noisy functions and (2) can take advantage of multicore computers. Here we use BIPOPCMAES [23, 8].
PIREM [12] is close to our approach as it leverages priors to learn the residual model and then performs policy search on the model^{2}^{2}2At the time of this submission, only a preprint of [12] was available; we had no access to the final, peerreviewed version as it has not been published yet.. However, PIREM assumes that the prior information is fixed and cannot be tuned, whereas our approach has the additional flexibility of being able to change the behavior of the prior. In addition, PIREM utilizes the policy search procedure of PILCO that can be limiting in many cases as already discussed. Nevertheless, as BlackDROPS and PILCO have been shown to perform similarly when PILCO’s limitations are not present [8], we include in our experiments a variant of our approach that resembles PIREM (BlackDROPS with priors).
V Experimental Results
Va Pendubot swingup task
We first evaluate our approach in simulation with the pendubot swingup task. The pendubot is a twolink underactuated robotic arm (with lenghts , and masses , ) and was introduced by [49] (Fig. 2). The inner joint (attached to the ground) exerts a torque , but the outer joint cannot (both of the joints are subject to some friction with coefficients , ). The system has four continuous state variables: two joint angles and two joint angular velocities. The angles of the joints, and , are measured anticlockwise from the upright position. The pendubot starts hanging down and the goal is to find a policy such that the pendubot swings up and then balances in the upright position. Each episode lasts s and the control rate is Hz. We use a distance based reward function as in [8].
We chose this task because it is a fairly difficult problem and forces slower convergence on modelbased techniques without priors, but not too hard (i.e., it can be solved without priors in reasonable interaction time); a fact that allowed us to make a rather extensive evaluation with meaningful comparisons (4 different prior models, 7 different algorithms, 30 replicates of each combination). We assume that we have 4 priors available; we tried to capture easy and difficult cases and cases where all the wrong parameters can be tuned or not (see Table I). We compare 7 algorithms:

BlackDROPS [8]

BlackDROPS with priors, which is essentially equivalent to PIREM [12] and close to GPILQG [11]^{3}^{3}3The algorithm in this specific form is first formulated in this paper (i.e., the BlackDROPS policy search procedure with a prior model), but, as discussed above, it is close in spirit with GPILQG [11] and PIREM [12]. Therefore, we assume that the performance of BlackDROPS with priors is representative of what could be achieved with PIREM and GPILQG, although BlackDROPS with priors should be more effective because it performs a more global search [8].

BlackDROPS with GPMI; our approach

BlackDROPS with MI; BlackDROPS where model learning is replaced by model identification (via mean squared error)

PILCO [7]

PILCO with priors [10]

IT&E [14]

Actual 






0.5 

0.5  0.5 



0.5  0.5 

0.5 



0.5  0.5  0.5  0.5  0.5  

0.5 

0.5 

0.5  

0.1  0.1  0.1  0.1 



0.1  0.1  0.1  0.1 

For BlackDROPS with GPMI and the MI variant, we additionally assume that the parameters , , and can be tuned, but the parameters and are fixed (to the prior) and cannot be changed. Since the adaptation part of IT&E is a deterministic algorithm (given the same prior) and our system has no uncertainty, for each prior we generated 30 archives with different random seeds and then ran the adaptation part of IT&E once for each archive. We used 3 equally spread in time endeffector positions as the behavior descriptor for the archive generation with MAPElites. For all the BlackDROPS variants and for IT&E we used a neural network policy with one hidden layer (10 hidden neurons) and the hyperbolic tangent as the activation function.
Similarly to IT&E, since PILCO with priors is a deterministic algorithm given the same prior, for each prior we ran PILCO 30 times with different random seeds on the prior model (for 40 episodes in order for PILCO to converge to a good policy and model) and then ran PILCO with priors on the actual system once for each different model. We used priors both in the policy and the dynamics model when learning in the actual system (as advised in [10]). We also used a GP policy with 200 pseudoobservations [7]^{4}^{4}4These are the parameters that come with the original code of PILCO. We used the code from: https://bitbucket.org/markjcutler/gaussianprocess..
BlackDROPS with GPMI always solves the task and achieves high rewards at least as fast as all the other approaches in the cases that we considered (Fig. 3). BlackDROPS with MI performs very well when the parameters it can tune are the ones that are wrong (Fig. 3A,B,C), and badly otherwise (Fig. 3D — i.e., no parameters of the prior model can explain the data). BlackDROPS with priors performs very well whenever the prior model is not far away from the real one (Fig. 3A,B) and not so well whenever the prior is misleading (Fig. 3C). Both BlackDROPS and PILCO cannot solve the task in less than of interaction time, but BlackDROPS shows a faster learning curve (Fig. 3).
Interestingly, PILCO with priors is not able to always achieve better results than BlackDROPS and is always worse than BlackDROPS with priors. This can be explained by the fact that PILCO without priors learns slower than BlackDROPS and is a more local search algorithm and as such needs more interaction time to achieve good results. On the contrary, BlackDROPS uses a modified version of CMAES that can more easily escape local optima [8]. Moreover, the initial prior model for PILCO with priors is an approximated model, whereas BlackDROPS with priors uses the actual prior model to begin with. Lastly, the GP policy, that PILCO is mainly used with^{5}^{5}5So far, PILCO can only be used with linear or GP policy types [7]., creates really high dimensional policy spaces compared to the simple neural network policy that BlackDROPS is using (i.e., 1400 vs 81 parameters) and as such causes the policy search to converge slower.
IT&E is not able to reliably solve the task and achieve high rewards. This is because IT&E assumes that (a) the system is redundant enough so that the task can be solved in many different ways and (b) there is a policy/controller in the precomputed archive that can solve the task (i.e., IT&E cannot search outside of this archive) [14]. Obviously, these assumptions are violated in the pendubot scenario: (a) the system is underactuated and thus does not have the required redundancy, and (b) the system is inherently unstable and as such precise policy parameters are needed (it is highly unlikely that one of them exists in the precomputed archive).
VB Physical hexapod locomotion
We also evaluate our approach on the hexapod locomotion task as introduced in the IT&E paper [14] with a physical robot (Fig. 1A). This scenario is where IT&E excels and achieves remarkable recovery capabilities [14]. We assume that a simulator of the intact robot is available (Fig. 1B); for GPMI we also assume that we can alter this simulator by removing 1 leg of the hexapod (i.e., there are 7 discrete different parameterizations). This simulator is not accurate as we assume perfect velocity actuators and infinite torque. Each leg has 3 DOF leading to a total of 18 DOF. The state of the robot consists of 18 joint angles, 18 joint velocities, a 6D Center Of Mass (COM) pose (position and orientation) and 6D COM velocities. The policy is an openloop controller with 36 parameters that outputs 18D joint angles every s and is similar to the one used in [14]. Each episode lasts s and the robot is tracked with a motion capture system.
The task is to find a policy to walk forward as fast as possible. Due to the complexity of the problem^{6}^{6}6PILCO and BlackDROPS could not find any solution in preliminary simulation experiments even after several minutes of interaction time and BlackDROPS with priors was worse than BlackDROPS with GPMI., we only compare 2 algorithms (IT&E and our approach) on 2 different conditions: (a) crossing the realitygap problem; in this case our approach cannot mostly rely on the identification part and the importance of the GP modeling will be highlighted, and (b) one rear leg is removed; the back leg removals are especially difficult as most effective gaits of the intact robot rely on them.
The results show that BlackDROPS with GPMI is able to learn highly effective walking policies on the physical hexapod robot (Fig. 4). In particular, using the dynamics simulator as prior information BlackDROPS with GPMI is able to achieve better (and with less variance) walking speeds than IT&E [14] on the intact physical hexapod (Fig. 4A). Moreover, in the rearleg removal damage case BlackDROPS with GPMI allows the damaged robot to walk effectively after only to seconds of interaction time and finds higherperforming policies than IT&E ( vs in the episode) (Fig. 4B).
Overall, BlackDROPS with GPMI was able to successfully learn working policies even though the dimensionality of the state and the action space of the hexapod robot is 48D and 18D respectively. In addition, in the rear leg damage case, BlackDROPS always tried safer policies than IT&E that too often executed policies that would cause the robot to fall over. A video of our algorithm running on the damaged hexapod is available at the supplementary video (also at https://youtu.be/HFkZkhGGzTo).
Vi Conclusion and Discussion
BlackDROPS with GPMI is one of the first modelbased policy search algorithms that can efficiently learn with highdimensional physical robots. It was able to learn walking policies for a physical hexapod (48D state and 18D action space) in less than 1 minute of interaction time, without any prior on the policy parameters (that is, it learns a policy from scratch). The blackbox nature of our approach along with the extra flexibility of tuning the blackbox prior model opens a new direction of experimentation as changing priors, robots or tasks requires minimum effort.
The way we compute the longterm predictions (i.e., by chaining model predictions) requires that predicted states (the output of the GPs) are fed back to the prior simulator. This can cause the simulator to crash because there is no guarantee that the predicted state, that possibly makes sense in the real world, will make sense in the prior model; especially when the two models (prior and real) differ a lot and when there are obstacles and collisions involved. This also holds for most other priorbased methods [11, 12, 10], but it is not easily seen in simple systems. On the contrary, we observed this phenomenon a few times in our hexapod experiments. Using the prior simulator just as a reference and not mixing prior and real data is a direction of future work.
Finally, BlackDROPS with GPMI brings closer trialanderror and diagnosisbased approaches for robot damage recovery. It successfully combines (a) diagnosis [50] (i.e., identifying the likeliest robot model from data), (b) prior knowledge of possible damages/different conditions that a robot may face and (c) trialanderror learning.
Appendix
Code for replicating the experiments: https://github.com/resibots/blackdrops.
Acknowledgments
The authors would like to thank Dorian Goepp, Rituraj Kaushik, Jonathan Spitz, and Vassilis Vassiliades for their feedback.
References
 [1] C. Atkeson et al., “No falls, no resets: Reliable humanoid behavior in the DARPA robotics challenge,” in Proc. of Humanoids, 2015.
 [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 [3] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in CVPR, 2009.
 [4] V. Mnih et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [5] M. P. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search for robotics,” Foundations and Trends in Robotics, vol. 2, no. 1, pp. 1–142, 2013.
 [6] A. S. Polydoros and L. Nalpantidis, “Survey of modelbased reinforcement learning: Applications on robotics,” Journal of Intelligent & Robotic Systems, pp. 1–21, 2017.
 [7] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes for dataefficient learning in robotics and control,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 2, pp. 408–423, 2015.
 [8] K. Chatzilygeroudis, R. Rama, R. Kaushik, D. Goepp, V. Vassiliades, and J.B. Mouret, “BlackBox Dataefficient Policy Search for Robotics,” in Proc. of IROS, 2017.
 [9] E. Keogh and A. Mueen, “Curse of dimensionality,” in Encyclopedia of Machine Learning. Springer, 2011, pp. 257–258.
 [10] M. Cutler and J. P. How, “Efficient reinforcement learning for robots using informative simulated priors,” in Proc. of ICRA, 2015.
 [11] G. Lee, S. S. Srinivasa, and M. T. Mason, “GPILQG: Datadriven Robust Optimal Control for Uncertain Nonlinear Dynamical Systems,” arXiv preprint arXiv:1705.05344, 2017.
 [12] M. Saveriano, Y. Yin, P. Falco, and D. Lee, “DataEfficient Control Policy Search using Residual Dynamics Learning,” in Proc. of IROS, 2017.
 [13] B. Bischoff, D. NguyenTuong, H. van Hoof, A. McHutchon, C. E. Rasmussen, A. Knoll, J. Peters, and M. P. Deisenroth, “Policy search for learning robot control using sparse data,” in Proc. of ICRA, 2014.
 [14] A. Cully, J. Clune, D. Tarapore, and J.B. Mouret, “Robots that can adapt like animals,” Nature, vol. 521, no. 7553, pp. 503–507, 2015.
 [15] A. Marco, F. Berkenkamp, P. Hennig, A. P. Schoellig, A. Krause, S. Schaal, and S. Trimpe, “Virtual vs. Real: Trading Off Simulations and Physical Experiments in Reinforcement Learning with Bayesian Optimization,” in Proc. of ICRA, 2017.
 [16] D. NguyenTuong and J. Peters, “Using model knowledge for learning inverse dynamics,” in Proc. of ICRA, 2010.
 [17] R. Camoriano, S. Traversaro, L. Rosasco, G. Metta, and F. Nori, “Incremental semiparametric inverse dynamics learning,” in Proc. of ICRA, 2016.
 [18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [19] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” in Proc. of ICML, 2015.
 [20] J. Kober and J. Peters, “Policy search for motor primitives in robotics,” Machine Learning, vol. 84, pp. 171–203, 2011.
 [21] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integral control approach to reinforcement learning,” JMLR, vol. 11, pp. 3137–3181, 2010.
 [22] D. Wierstra et al., “Natural evolution strategies,” JMLR, vol. 15, no. 1, pp. 949–980, 2014.
 [23] N. Hansen and A. Ostermeier, “Completely derandomized selfadaptation in evolution strategies,” Evolutionary computation, vol. 9, no. 2, pp. 159–195, 2001.
 [24] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 1998.
 [25] J. Kober and J. Peters, “Imitation and reinforcement learning,” IEEE Robotics & Automation Magazine, vol. 17, no. 2, pp. 55–62, 2010.
 [26] F. Stulp and O. Sigaud, “Robot skill learning: From reinforcement learning to evolution strategies,” Paladyn, Journal of Behavioral Robotics, vol. 4, no. 1, pp. 49–61, 2013.
 [27] A. Cully and J.B. Mouret, “Behavioral repertoire learning in robotics,” in GECCO. ACM, 2013.
 [28] A. Majumdar and R. Tedrake, “Funnel libraries for realtime robust feedback motion planning,” IJRR, vol. 36, no. 8, pp. 947–982, 2017.
 [29] R. Antonova, A. Rai, and C. G. Atkeson, “Sample efficient optimization for learning controllers for bipedal locomotion,” in Proc. of Humanoids, 2016.
 [30] J.B. Mouret and J. Clune, “Illuminating search spaces by mapping elites,” arxiv:1504.04909, 2015.
 [31] V. Vassiliades, K. Chatzilygeroudis, and J.B. Mouret, “Using centroidal voronoi tessellations to scale up the multidimensional archive of phenotypic elites algorithm,” IEEE Trans. on Evolutionary Computation, 2017.
 [32] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the human out of the loop: A review of Bayesian optimization,” Proc. of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
 [33] J. Ko, D. J. Klein, D. Fox, and D. Haehnel, “Gaussian processes and reinforcement learning for identification and control of an autonomous blimp,” in Proc. of ICRA, 2007.
 [34] E. Todorov and W. Li, “A generalized iterative LQG method for locallyoptimal feedback control of constrained nonlinear stochastic systems,” in Proc. of ACC, 2005.
 [35] J. Hollerbach, W. Khalil, and M. Gautier, “Model identification,” in Springer Handbook of Robotics. Springer, 2016, pp. 113–138.
 [36] M. Gautier and W. Khalil, “Exciting trajectories for the identification of base inertial parameters of robots,” IJRR, vol. 11, no. 4, pp. 362–375, 1992.
 [37] F. Aghili, J. M. Hollerbach, and M. Buehler, “A modular and highprecision motion control system with an integrated motor,” IEEE/ASME Transactions on Mechatronics, vol. 12, no. 3, pp. 317–329, 2007.
 [38] C. Xie, S. Patil, T. Moldovan, S. Levine, and P. Abbeel, “Modelbased reinforcement learning with parametrized physical models and optimismdriven exploration,” in Proc. of ICRA, 2016.
 [39] J.B. Mouret and K. Chatzilygeroudis, “20 Years of Reality Gap: a few Thoughts about Simulators in Evolutionary Robotics,” in Workshop” Simulation in Evolutionary Robotics”, GECCO, 2017.
 [40] K. Chatzilygeroudis, V. Vassiliades, and J.B. Mouret, “Resetfree TrialandError Learning for Robot Damage Recovery,” arXiv:1610.04213, 2016.
 [41] Y. Engel, S. Mannor, and R. Meir, “Reinforcement learning with Gaussian processes,” in Proc. of ICML. ACM, 2005.
 [42] D. NguyenTuong and J. Peters, “Model learning for robot control: a survey,” Cognitive Processing, vol. 12, no. 4, pp. 319–340, 2011.
 [43] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning. MIT Press, 2006.
 [44] M. Blum and M. A. Riedmiller, “Optimization of Gaussian process hyperparameters using Rprop,” in Proc. of ESANN, 2013.
 [45] A. Cully, K. Chatzilygeroudis, F. Allocati, and J.B. Mouret, “Limbo: A fast and flexible library for Bayesian optimization,” arxiv:1611.07343, 2016.
 [46] T. H. Rowan, “Functional stability analysis of numerical algorithms,” 1990.
 [47] G. Johnson Steven, “The NLopt nonlinearoptimization package.”
 [48] A. Kupcsik, M. P. Deisenroth, J. Peters, A. P. Loh, P. Vadakkepat, and G. Neumann, “Modelbased contextual policy search for dataefficient generalization of robot skills,” Artificial Intelligence, 2014.
 [49] M. W. Spong and D. J. Block, “The pendubot: A mechatronic system for control research and education,” in Proc. of Decision and Control, 1995.
 [50] R. Isermann, Faultdiagnosis systems: an introduction from fault detection to fault tolerance. Springer Science & Business Media, 2006.