Sample Efficient Optimization for Learning Controllers
for Bipedal Locomotion
Abstract
Learning policies for bipedal locomotion can be difficult, as experiments are expensive and simulation does not usually transfer well to hardware. To counter this, we need algorithms that are sample efficient and inherently safe. Bayesian Optimization is a powerful sampleefficient tool for optimizing nonconvex blackbox functions. However, its performance can degrade in higher dimensions. We develop a distance metric for bipedal locomotion that enhances the sampleefficiency of Bayesian Optimization and use it to train a 16 dimensional neuromuscular model for planar walking. This distance metric reflects some basic gait features of healthy walking and helps us quickly eliminate a majority of unstable controllers. With our approach we can learn policies for walking in less than 100 trials for a range of challenging settings. In simulation, we show results on two different costs and on various terrains including rough ground and ramps, sloping upwards and downwards. We also perturb our models with unknown inertial disturbances analogous with differences between simulation and hardware. These results are promising, as they indicate that this method can potentially be used to learn control policies on hardware.
I Introduction
Designing and learning policies for bipedal locomotion is a challenging problem, as it is extremely expensive to do such experiments on an actual robot. We typically do not have robots that can take a fall, and it is cumbersome to perform these experiments. On top of this, most objective functions are nonconvex, nondifferentiable and noisy. With these considerations in mind, it is important to find optimization methods that are sample efficient, robust to noise and nonconvexity, and try to minimize the number of bad policies sampled. Bayesian Optimization is one such gradientfree blackbox global optimization method, that is sample efficient and robust to noise.
One common way of learning controllers is to come up with a control policy parameterizations and a cost function [4], [23], [6]. This cost is now minimised with respect to the policy parameters. In general, a variety of optimization approaches could be applied to this problem, for example gradient descent, evolutionary algorithms, and random search. Approaches like grid search, pure random search, and various evolutionary algorithms usually make the least restrictive assumptions, but are not sampleefficient. Gradientbased algorithms can be very effective when an analytical gradient is available or can be approximated effectively. However, random restarts are usually necessary to optimize nonconvex functions to avoid bad local optima, reducing sample efficiency. Bayesian optimization is well suited for such optimization problems as it constructs a global representation of the cost function, while only reducing uncertainty in promising regions of the search space.
In this work, we learn optimal reflex parameters for neuromuscular models described in [9]. These models are humaninspired control pathways that are capable of producing locomotion behaviour in a variety of scenarios as demonstrated in [18]. We start with a 2 dimensional 7link simulated robot with hip, knee and ankle actuation. We formulate a cost function which incorporates components like distance and time walked and optimize it over a set of 16 parameters of the neuromuscular model.
Promising results were reported in prior work for optimizing an 8dimensional control policy for a small bipedal robot using Bayesian Optimization [4]. However, our 16dimensional search space proved to be challenging for standard Bayesian Optimization, with performance not much better than uniform random search. Bayesian Optimization can take advantage of a domainspecific kernel, which gives an informed similarity between different policies. In this paper, we achieve this by developing a Determinants of Gait (DoG) kernel for the domain of bipedal ocomotion. Our kernel uses gait characteristics described in [17] to create an appropriate similarity metric between different parameter sets of the neuromuscular model. Under this metric, policies that generate successful walking gaits are closer together and policies that result in a fall are more distant from successful ones. This helps the optimization to effectively separate good regions of the parameter space from bad regions, making it more sample efficient.
We demonstrate that our kernel can substantially reduce the number of function evaluations (or robot trials) needed for optimization on different terrains with modeling disturbances. This leads us to believe that our kernel helps improve sample efficiency in conditions different from those in which it was generated. Potentially, we can generate this kernel in simulation and use it for optimization on the actual robots.
Ii Background
Iia Overview of Bayesian Optimization
Bayesian Optimization is a framework for sequential global search to find a vector that minimizes a cost function , while evaluating as few times as possible ([3] give an overview).
The optimization starts with initializing a prior to capture uncertainty over the value of for each in the domain. At iteration an auxiliary function , called an acquisition function, is used to sequentially select the next parameter vector to test, . is then evaluated by doing an experiment, and used to update our estimate of . The aim of the acquisition function is to achieve an effective tradeoff between exploration and exploitation. The use of an acquisition function is illustrated in Figure 1 (1D example).
A common way to model the prior and posterior for is by using a Gaussian Process , with mean function and kernel . The mean of the prior can be set to if no relevant domainspecific information is available. The kernel encodes how similar is expected to be for two inputs : points close together are expected to influence each other strongly, while points far apart would have almost no influence. The most widely used kernel is Squared Exponential kernel of the form .
IiB Optimization for Bipedal Locomotion
Bayesian Optimization (BO) with Gaussian Processes and closely related methods have been recently applied to several robotics domains. Krause et al. [12] developed an approach utilizing Gaussian Processes and the principle of optimizing mutual information for solving sensor placement problems. MartinezCantin et al. [14] used BO for online path planning for optimal sensing with a mobile robot. Lizotte et al. [13] used a closely related approach of Gaussian Process Regression to optimize the gait on a quadruped robot and showed that this approach required substantially fewer evaluations than stateoftheart local gradient approaches.
More specific to the domain of bipedal locomotion, Calandra et al. used BO to efficiently find gait parameters that optimize a desired cost [4]. They optimized eight parameters  four threshold values of a finite state machine of a walking controller and four control signals applied during extension and flexion of knees and hips  for a small biped.
While these previous results are encouraging, it is not immediately clear whether BO would be as successful in finding good policies for higherdimensional controllers. Calandra et al. mentioned that only around 1% of the parameter space they considered led to walking gaits, and we have observed similar difficulties in our experiments in 16 dimensions. Hence two questions arise : would BO be effective if the dimensionality is increased from 8 to 16? And if it does, how does it compare to previously used approaches, like CMAES [18]?
Iii Review of Neuromuscular models
We use neuromuscular model policies, as introduced in [9], as our controller for a 7link planar humanlike model. These policies use approximate models of muscle dynamics and humaninspired reflex pathways to generate joint torques, producing gaits that are similar to human walking in stance. [7] designed reflex laws for swing that enabled target footplacement and leg clearance, by analyzing the double pendulum dynamics of the human leg. Integrating this swing control with the previous reflex control enables the model to overcome disturbances in the range of up to cm [18].
Iii1 Neuromuscular Stance Control
In stance, each leg is actuated by 7 Hilltype muscles [15], consisting of the soleus (SOL), gastrocnemius (GAS), vastus (VAS), hamstring (HAM), tibialis anterior (TA), hip flexors (HFL) and gluteus (GLU), illustrated in Figure 3. Together, these muscles produce torques about the hip, knee and ankle. The muscle force is a nonlinear function of the muscle state and stimulus , which when multiplied by the moment arm gives the resultant torque on joint :
where is the torque applied by muscle on joint and is the joint angle.
Most of the muscle reflexes in stance are positive length or force feedbacks on the muscle stimulus. In general, the stimulus for muscle is a function of the time delayed length or force signal times a feedback gain :
where is the prestimulus, is the feedback gain and is the timedelayed feedback signal of length or force. Some muscles can be coactivated and have multiple feedback signals from more than one muscle. The feedback gains described above are a subset of the parameters that we aim to tune in our optimization. The details of these feedback pathways can be found in [18].
This feedback structure generates compliant leg behaviour and prevents the knee from overextending in stance. To balance the trunk, feedback on the torso angle is added to the GLU stimulus:
where is the position gain on the torso angle and is the desired angle. is the velocity gain and is the angular velocity. Specifically, here are the stance parameters we optimize over, and their roles in the neuromuscular model:

: Positive force feedback gain on GAS

: Positive force feedback gain on GLU

: Positive force feedback gain on HAM

: Positive force feedback gain on SOL

: Negative force feedback from SOL on TA

: Positive length feedback on TA

: Positive force feedback on VAS

: Position gain on feedback on torso angle

: Velocity gain on feedback on torso velocity

: Gain for mixing force feedback and feedback on angle for GLU
Iii2 Swing Leg Placement Control
The swing control is controlled by three main components – target leg angle, leg clearance and hip control. Target leg angle is a direct result of the foot placement strategy which is a function of the velocity of the center of mass (CoM) , and the as distance between the stance leg the CoM, and presented in [23]:
where is the target leg angle, is the nominal leg angle, , and are parameters optimized by our control.
Leg clearance is a function of the desired leg retraction during swing. The knee is actively flexed until the leg reaches the desired leg clearance height, and then held at this height, until the leg reaches a threshold leg angle. At this point, the knee is extended and allowed to reach the target leg angle . Details of this control can be found in [7]. As was noted in [18], and observed in our experiments, the control is relatively insensitive to the individual gains of the setup in swing. It is sufficient to control the higher level parameters such as the desired leg clearance and target leg angle.
The third part of the control involves maintaining the desired leg angle by applying a hip torque :
where is the position gain on the leg angle, is the velocity gain, is the leg angle and is the leg angular velocity (see Figure 3).
More concisely, the swing parameters that we focus on in our optimization are the following:

: Position gain on feedback on leg angle

: Velocity gain on feedback on leg velocity

: Nominal leg angle

: Gain on the horizontal distance between the stance foot and CoM

: Gain on the horizontal velocity of the CoM

: Desired leg clearance
Though originally developed for explaining human neural control pathways, these controllers have recently been applied to robots and prosthetics, for example in [19] and [20]. As demonstrated in [18], these models are indeed capable of generating a variety of locomotion behaviours for a humanoid model  for example, walking on flat, rough ground, turning, running, walking upstairs and on ramps. However, a full study of using these models to control biped robots still needs to be done. Whether these models will transfer well to robots with significantly different dynamics and inertial properties than humans needs to be explored. It is difficult to transfer these models to robots because of a large number of interdependent gains that need to be tuned. Typically, this is done using Covariance Matrix Adaptation Evolutionary Strategy (CMAES) [10], an evolutionary algorithm for difficult nonlinear nonconvex blackbox optimization problems. Even though CMAES is useful for optimizing nonconvex problems in high dimensions, it is not sample efficient and depends on the initial starting point. An optimization for 16 neuromuscular parameters takes 400 generations, around a day on a standard i7 processor and about 5,000 trials, as reported in [18].
The large number of trials make it impossible to implement CMAES on a real robot. This is a shortcoming because often we find that after training the policies in simulation, they do not transfer well to the real robot, due to differences between simulation and real hardware.
Iv Determinants of Gait Kernel
Iva Kernels for Sequential Decision Making
As described in section IIA, the kernel captures how similar the cost function is expected to be for parameter vectors and , and in the case of using Squared Exponential kernel, the similarity is a function of the Euclidean distance between .
A more informed alternative is to use a kernel that specifically leverages the structure of the problem at hand, for example the resulting trajectories or behavior. Intuitively, a kernel that can better encode similarity among policies will be more sampleefficient, since it will be better able to generalize across policies with similar performance. The BehaviorBased Kernel (BBK) of [21] is one kernel that leverages structure in the trajectories generated by the evaluated policies. To determine the similarity between policies given by vectors , BBK uses the similarity of the trajectory distributions induced by the corresponding policies (instead of the default approach of utilizing Euclidean distance ). While this could help in settings where computing a trajectory for a set of policy parameters is inexpensive, in the setting of robotic locomotion this requires running a simulation or executing the policy on the real hardware. This amounts to being as expensive as a cost function evaluation which makes BBK infeasible for such problems. Nonetheless, the idea of using auxiliary information in the kernel is promising, if this information can be precomputed for a large portion of the policy space and made available during online optimization. We describe our approach for constructing such a kernel in the next section. Our kernel effectively incorporates domain knowledge available in bipedal locomotion and eliminates the need for computing full trajectories during online optimization. Instead, it uses behavior information from only a short part of the trajectories pregenerated during an offline phase.
IvB Determinants of Gait Kernel
Bipedal walking can be characterized with some basic metrics, called gait determinants, as described in [17]. The six determinants of gait deal with the conservation of energy and maintaining forward momentum during human walking. For developing our kernel, we focussed on the knee flexion in swing, ankle movement and center of mass trajectory.
To compute the gait determinants of a given set of parameters, we run a short simulation for 5 seconds (as compared to 100 seconds for a complete trial). Next, we compute the score of the parameters on the following metrics :

Is the knee flexed in swing?
(1) Here, is the knee joint angle in swing, and are the high and low thresholds on knee angle, similar to human data as described in [22].

Is there heelstrike and toeoff?
(2) is the ankle joint angle at heelstrike (start of stance) and is the ankle joint angle at takeoff (end of stance). The two conditions ensure heelstrike and toeoff respectively.

Is the center of mass movement approximately oscillatory between steps?
(3) , and are center of mass heights at heel strike, mid stance and takeoff.

Is the torso leaning forward?
(4) is the mean torso angle which should be leaning forward for a energy efficient forward movement.

Deviation from average human walking speed
(5) and are the average simulator speed and the average human walking speed, [22].
are binary and is continuous per step. These are then summed up to form the total score for step :
(6) 
The final metric is then computed as a sum of scores over all the steps:
(7) 
where is the total number of steps in the first 5 seconds of simulation.
With this, a 16D point in the original parameter space now corresponds to a 1D point in this new feature space and we obtain our Determinants of Gait kernel:
is a very coarse measurement of the chances of the policy induced by resulting in stable walking movements over longer simulation periods. More importantly, points that lead to obviously unstable movements obtain a similar score of near zero, and are therefore grouped together. This kernel has no explicit information of the specific cost we are trying to optimize. It can very easily be used across multiple costs for walking behaviours, over slightly disturbed models as well as across multiple optimization methods.
V Experiments
In this section, we describe our experiments with the DoG kernel on two cost functions. First, we introduce our experimental settings and then move on to the results.
Va Details of Experimental Setup: Cost Function and Algorithms Compared
To ensure that our approach can perform well across various cost functions, we conduct experiments on two different costs, constructed such that parameter sets achieving low cost also achieve stable and robust walking gaits. The first cost function varies smoothly over the parameter space:
(8) 
where is seconds walked, is the final hip position, is mean speed and is the desired walking speed (from human data). This cost encourages walking further and for longer through the first two terms, and penalizes deviating from the target speed with the last.
The second cost function is a slightly modified version of the cost used in [18] for experiments with CMAES. It penalizes policies that lead to falls in a nonsmooth manner:
(9) 
Here is the distance travelled before falling, is the average speed in simulation, is the target speed and is the cost of transport. The first term directly penalizes policies that result in a fall, inversely to the distance walked. If the model walks for the simulation time, the cost is lower, ensured by the constants, and encourages policies that result in lower cost of transport and walk at target velocity. Since we have the same set of gains for left and right legs, the steadiness cost of the original cost [18] was unimportant.
In the following sections we compare the performance of several baseline and stateoftheart optimization algorithms in simulation. Motivated by the discussion in [5], we include the baseline of uniform random search. While this search is uninformed and not sampleefficient, it could (perhaps surprisingly) serve as a competitive baseline in nonconvex highdimensional optimization problems. We also provide comparisons with CMAES [10] and Bayesian Optimization with a Euclidean kernel (basic BO). Since we were optimizing a nonconvex function in a 16D space, it was not feasible to calculate the global minimum exactly. To estimate the global minimum for the costs we used, we ran CMAES (until convergence) and BO with our domain kernel (for 100 trials) for 50 runs without model disturbances on flat ground. When reporting results, we plot the best results found in this easier setting as the estimated optimum for comparison.
VB Model Disturbances
Most real robots have poor dynamic models, as well as unmodeled disturbances, like friction, nonrigid dynamics, etc which make simulations a poor representation of the real robot. There has been a lot of work done in identifying dynamic models of robots reliably, for example in [1]. However, while such methods can definitely help bring simulators close to the real robot, there are other discrepancies like nonrigid dynamics and friction which are still very hard to model. As a result, often controllers that work well in simulation lead to poor performance on the real robot. In such cases, ideally, we would like to have optimization techniques that quickly adapt to this slightly different setting and find a new solution in a few cost function evaluations.
To test if our approach is capable of generalizing to unforeseen disturbances, modeling and environmental perturbations, we conduct our experiments on models with mass and inertial disturbances and on different ground profiles. We perturb the mass of each link, inertia and center of mass location randomly by up to of the original value. For mass/inertia we randomly pick a variable from a uniform distribution between , where is the original mass/inertia of the segment. Similarly we change the location of the center of mass by , where is the length of the link. These disturbances are different for each run of our algorithm, hence we test a wide range of possible modelling disturbances. For the ground profiles, we generate random ground height disturbances of upto per step.
VC Experiments with DoG Kernel
We precompute Determinants of Gait (DoG) kernel scores for 100,000 parameter sets, which takes 710 hours on a modern desktop to speed up our computations. These samples are generated using a Sobol sequence [2] on an undisturbed model on flat ground. Thereafter, the same kernel is used for all the experiments described below.
In experiments with Bayesian Optimization, we were directly able to replace the Euclidean distances of a squared exponential kernel with the distance in the DoG kernel space:
(10) 
with as described in Section IVB. We used a Matlab implementation of Bayesian Optimization from [8] and used a presampled grid when considering next candidates for optimization. This allowed us to reuse our precomputed scores to speed up kernel computations, but restricted us to only use these presampled parameter sets. This can be harmful if an optimal set was not sampled; we sampled a dense grid to decrease the probability of this happening.
Our DoG scores were obtained from an unperturbed model of our system on flat ground. Our experimental results, however, were obtained on settings with different ground profiles and model disturbances (as discussed in Section VB). These perturbed settings were designed such that originally optimal set of policy parameters would likely become suboptimal. This is illustrated in the top and middle rows of Figure 4, where the policy performing well on flat ground falls on rough ground. This shows that our perturbations were indeed significant. After using the kernel for the optimization in these perturbed settings, we observed that best policies found were able to walk on rough ground (the lower part of Figure 4). This suggests that our kernel can be used to find optimum in settings significantly different than those it was created on.
All the experiments described below are done for 50 independent runs, each with a unique set of modeling disturbances and a different ground profile for rough ground walking. Each run consists of 100 trials or cost function evaluations, in which the optimization algorithm evaluates a parameter set for 100 seconds of simulation. Note that the disturbances and ground profiles remain constant across each run (and 100 trials).
VC1 Experiments on the smooth cost function
Figure 5 shows results of our experiments using the DoG kernel on the smooth cost. For BO with DoG kernel, 2530 cost function evaluations were sufficient to find points that corresponded to robot model walking on a randomly generated rough ground with disturbance. This is in contrast to basic BO that did not find such results in under 100 trials.
To let CMAES also benefit from the kernel, we started each run from one of the best 100 points for the DoG kernel. After tuning the parameter of CMAES to make it exploit more around the starting point, we were able to find policies that resulted in walking on rough ground after 6570 cost function evaluations on most runs. On the other hand, CMAES starting from a random initial point was not able to find walking policies in 100 evaluations.
These results suggest that DoG scores successfully captured useful information about the parameter space and were able to effectively focus BO and CMAES on the promising regions of the policy search space.
VC2 Experiments with the nonsmooth cost
We observed good performance on the nonsmooth cost function (Figure 6), though it was not as remarkable as the smooth cost. BO with kernel still outperformed all other methods by a margin, but this different cost seems to hurt BO and CMAES alike. Since this cost is discontinuous, there is a huge discrepancy between costs for parameters that walk and those that don’t. If no walking policies are sampled, BO learns little about the domain and samples randomly, which makes it difficult to find good parameters. Hence not all runs find a walking solution. BO was able to find successful walking in of cases on rough ground with disturbance in less than 60 trials/evaluations. CMAES starting from a good kernel point was able to do it in of runs.
This showed that our kernel was indeed independent of the cost function to an extent, and worked well on two very different costs. We believe that the slightly worse performance on the second cost is because of the cost structure, rather than a kernel limitation, as it still finds walking solutions for a significant portion of runs.
VC3 Experiments on different terrains
We also optimized on ramps – sloping upwards, as well as downwards. The ramp up and down ground slopes were gradually increased every , until the maximum slope was reached. The maximum slopes for going down and going up were (). BO with DoG kernel was able to find parameters that walked for 100 seconds in of cases in ramp up and in ramp down. Example optimized policies walking up and down slope are shown in Figure 7.
We believe the reason we could not find walking policies on ramps in all runs, was that we are not optimizing the hip lean, which was noted to be crucial for this profile in [18]. Since we did not consider this variable when generating our 16 dimensional kernel, it was not trivial to optimize over it without regenerating the grid. Similarly, we found that we could not find any policies that climbed up stairs. Perhaps this could be achieved when optimizing over a much larger set of parameters, as in [18].
In the future, we would like to include more variables for optimizing over different terrains, and include them as part of the kernel.
Vi Conclusions
In this work we focused on sampleefficiently finding walking policies for a bipedal neuromuscular model. This high dimensional optimization problem proved challenging for standard Bayesian Optimization. So, we introduced the Determinants of Gait (DoG) metric and constructed the corresponding DoG kernel to effectively incorporate domain knowledge into the kernel. For our experiments we precomputed the kernel on flat ground with an unperturbed model, and then tested in more challenging settings. We demonstrated that our approach offers improved performance for learning walking patterns on different ground profiles, like rough ground, ramp up and ramp down, all with various unknown inertial disturbances to the original model.
Our results motivated us to consider several directions of future work. One of the next steps would be to experiment with learning more parameters of the neuromuscular model. Adjusting more parameters would allow us to finetune walking behaviors for more challenging settings like stairs and steeper ramps. This would also make the problem more challenging because of the increase in the dimensionality of the search space. An informed kernel could provide robust performance by simplifying the search. We also would like to experiment with different ways of computing final DoG scores, perhaps by constructing a kdimensional vector of individual metrics instead of collapsing them into a scalar score. And most importantly, we want to experiment with our approach on real hardware. We developed our experimental setup with future hardware experiments in mind, so we hope our approach would offer the needed sample efficiency to enable learning control policy parameters on the real hardware efficiently and adaptively.
References
 [1] An, Chae H. and Atkeson, Christopher G. and Hollerbach, John M. Modelbased Control of a Robot Manipulator. MIT Press, Cambridge, MA, USA, 1988.
 [2] Paul Bratley and Bennett L Fox. Algorithm 659: Implementing Sobol’s Quasirandom Sequence Generator. ACM Transactions on Mathematical Software (TOMS), 14(1):88–100, 1988.
 [3] Eric Brochu, Vlad M Cora, and Nando De Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. arXiv preprint arXiv:1012.2599, 2010.
 [4] Roberto Calandra, Nakul Gopalan, André Seyfarth, Jan Peters, and Marc Peter Deisenroth. Bayesian Gait Optimization for Bipedal Locomotion. In Learning and Intelligent Optimization, pages 274–290. Springer, 2014.
 [5] Roberto Calandra, André Seyfarth, Jan Peters, and Marc Peter Deisenroth. Bayesian Optimization for Learning Gaits Under Uncertainty. Annals of Mathematics and Artificial Intelligence, 76(12):5–23, 2016.
 [6] Stelian Coros, Philippe Beaudoin, and Michiel van de Panne. Generalized Biped Walking Control. In ACM Transactions on Graphics (TOG), volume 29, page 130. ACM, 2010.
 [7] Ruta Desai and Hartmut Geyer. Robust Swing Leg Placement Under Large Disturbances. In ROBIO 2012, pages 265–270. IEEE, 2012.
 [8] Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John Cunningham. Bayesian Optimization with Inequality Constraints. In ICML, pages 937–945, 2014.
 [9] Hartmut Geyer and Hugh Herr. A Musclereflex Model that Encodes Principles of Legged Mechanics Produces Human Walking Dynamics and Muscle Activities. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 18(3):263–273, 2010.
 [10] Nikolaus Hansen. The CMA Evolution Strategy: A Comparing Review. In Towards a New Evolutionary Computation, pages 75–102. Springer, 2006.
 [11] Scott Clark (Yelp Inc). Introducing MOE: Metric Optimization Engine. http://engineeringblog.yelp.com/2014/07/introducingmoemetricoptimizationengineanewopensourcemachinelearningserviceforoptimalex.html.
 [12] Andreas Krause, Ajit Singh, and Carlos Guestrin. Nearoptimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies. The Journal of Machine Learning Research, 9:235–284, 2008.
 [13] Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans. Automatic Gait Optimization with Gaussian Process Regression. In IJCAI, volume 7, pages 944–949, 2007.
 [14] Ruben MartinezCantin, Nando de Freitas, Eric Brochu, José Castellanos, and Arnaud Doucet. A Bayesian ExplorationExploitation Approach for Optimal Online Sensing and Planning with a Visually Guided Mobile Robot. Autonomous Robots, 27(2):93–103, 2009.
 [15] JB Morrison. The Mechanics of Muscle Function in Locomotion. Journal of Biomechanics, 3(4):431–451, 1970.
 [16] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
 [17] J. B. dec. M. Saunders, Verne T. Inman, and Howard D. Eberhart. The Major Determinants in Normal and Pathological Gait. The Journal of Bone & Joint Surgery, 35(3):543–558, 1953.
 [18] Seungmoon Song and Hartmut Geyer. A Neural Circuitry that Emphasizes Spinal Feedback Generates Diverse Behaviours of Human Locomotion. The Journal of Physiology, 593(16):3493–3511, 2015.
 [19] Nitish Thatte and Hartmut Geyer. Toward Balance Recovery with Leg Prostheses Using Neuromuscular Model Control. IEEE Transactions on Biomedical Engineering, 63(5):904–913, 2016.
 [20] Nicolas Van der Noot, Auke J Ijspeert, and Renaud Ronsse. Biped Gait Controller for Large Speed Variations, Combining Reflexes and a Central Pattern Generator in a Neuromuscular Model. In ICRA 2015, pages 6267–6274. IEEE, 2015.
 [21] Aaron Wilson, Alan Fern, and Prasad Tadepalli. Using Trajectory Data to Improve Bayesian Optimization for Reinforcement Learning. The Journal of Machine Learning Research, 15(1):253–282, 2014.
 [22] DA Winter and HJ Yack. EMG Profiles During Normal Human Walking: StridetoStride and Intersubject Variability. Electroencephalography and Clinical Neurophysiology, 67(5):402–411, 1987.
 [23] KangKang Yin, Kevin Loken, and Michiel van de Panne. Simbicon: Simple Biped Locomotion Control. In ACM Transactions on Graphics (TOG), volume 26, page 105. ACM, 2007.