Iterative Reinforcement Learning Based Design of
Dynamic Locomotion Skills for Cassie
Abstract
Deep reinforcement learning (DRL) is a promising approach for developing legged locomotion skills. However, the iterative design process that is inevitable in practice is poorly supported by the default methodology. It is difficult to predict the outcomes of changes made to the reward functions, policy architectures, and the set of tasks being trained on. In this paper, we propose a practical method that allows the reward function to be fully redefined on each successive design iteration while limiting the deviation from the previous iteration. We characterize policies via sets of Deterministic Action Stochastic State (DASS) tuples, which represent the deterministic policy stateaction pairs as sampled from the states visited by the trained stochastic policy. New policies are trained using a policy gradient algorithm which then mixes RLbased policy gradients with gradient updates defined by the DASS tuples. The tuples also allow for robust policy distillation to new network architectures. We demonstrate the effectiveness of this iterativedesign approach on the bipedal robot Cassie, achieving stable walking with different gait styles at various speeds. We demonstrate the successful transfer of policies learned in simulation to the physical robot without any dynamics randomization, and that variablespeed walking policies for the physical robot can be represented by a small dataset of 510k tuples.
I Introduction
Recent success in deep reinforcement learning (DRL) has inspired much work towards constructing locomotion policies for legged robots. Impressive results have been demonstrated on planar bipeds [19], quadruped robots, [35, 14], and 6legged robots [5]. However, these systems are relatively stable in comparison to humanscale bipeds, for which convincing demonstrations of DRL methods to dynamic locomotion on real hardware are still lacking, to the best of our knowledge. Nevertheless, a variety of results in simulation point to the promise of DRL methods in this area, e.g., [43, 24, 41, 8]. In practice, a multitude of issues can preclude the successful transfer of policies from simulation to the physical robot, including state estimation, modeling discrepancies, and motions that can cause excessive wear on hardware.
In this paper, we propose a DRL design process that reflects and supports the iterative nature of control policy design. At its heart is a data collection technique that allows us to recover a trained policy from a relatively small number of samples. With this technique, we can quickly compress and combine locomotion policies with supervised learning. By using policy gradient updates that combine the supervised learning samples and conventional DRL policygradient samples, we allow for the iterative design of improved policies using new reward functions that encourage desired behaviors. We validate our approach in simulation and on a physical Cassie robot, demonstrating stable walking policies with different styles at various speeds. Frames from a learned forwardwalking gait for Cassie are shown in Fig. 1.
To summarize, this paper makes the following contributions:

We present a simpleyeteffective technique to reconstruct policies from only a small number of samples, and show that robust variablespeed walking policies can be achieved on physical hardware using datasets of 510k tuples taken from simulations.

We combine reinforcement learning with supervised learning from this small number of samples. Guided by these samples, new policies can be learned by designing new reward functions that define desired changes to the behaviors while staying close to the original policy. This offers a strong alternative to ”finetuning” approaches, where an existing policy may be adapted via small changes and additions to an existing reward function, but which results in evermore cumbersome reward functions and may exhibit unexpected changes in behavior.

We apply this approach to train various locomotion policies on a simulated model of the bipedal robot Cassie. These policies are successfully transferred to the physical robot without any further tuning or adaptation.
To the best of our knowledge, we believe this is the first time that neural network policies for variablespeed locomotion have been successfully deployed on a humanscale bipedal robot such Cassie. The policies trained in simulation are directly transferred to the physical robot without the use of the dynamics randomization methods. The gaits are comparable or faster in speed than other gaits reported in the literature for Cassie, e.g., [7].
Ii Related Work
Iia Supervised Learning for Trajectory Optimization
There exists significant prior art on generating optimal trajectories using supervised learning. This can be made fast and robust by precomputing a library of solutions and using a “warm start” for new problems using nearest neighbors, e.g., Liu et al. [20], Tang and Hauser [36]. Regression using neural networks has been used to generate optimal trajectories for bipedal robots [6] and quadrotors [37]. Guided Policy Search [18] makes use of solutions from trajectory optimization to guide the policy search.
IiB Imitation Learning
Imitation learning seeks to approximate an expert policy. In its simplest form, one can collect a sufficiently large sample of stateaction pairs from the expert and apply supervised learning, which is also referred to as behavior cloning, as used in early seminal autonomous driving work by Pomerleau [26]. However, due to issues of compounding errors and covariate shift, this method often leads to failure [28]. The DAGGER method [29] is proposed to solve this problem, where the expert policy is iteratively queried to augment the expert dataset. Laskey et al. [17] injected adaptive noise into the expert policy to reduce expert queries. Recent work learns a linear approximation of the expert policy [21]. Another line of work is to formulate imitation learning problems as reinforcement learning problems by inferring the reward signal from expert demonstration using methods such as GAIL [13]. Expert trajectories can also be stored in a reinforcement learning agent’s experience buffer to accelerate the reinforcement learning process [3, 4]. Dynamic Movement Primitives [31, 22] provide another approach to incorporating expert demonstration to learn motor skills.
IiC Distillation
Supervised learning is often used to combine multiple policies. For example, it is successfully used to train policies to play multiple Atari games [30, 23]. More recently, Berseth et al. [2] use it to train a simulated 2D humanoid to traverse different types of terrains. These methods still suffer from the covariate shift problem and need to use DAGGER in the process.
IiD Bipedal Locomotion
Bipedal locomotion skills are important for robots to be able to traverse terrains that are typical in human environments. Many methods use the Zero Moment Point (ZMP) to plan stable walking motions, e.g., [12, 39]. Low dimensional models such as linear inverted pendulum (LIP) and spring loaded inverted pendulum (SLIP) can be used to simplify the robot dynamics [15, 1, 42] for easier and faster planning. To utilize the full dynamics of the robots, offline trajectory optimization such as direct collocation [11] is often used to generate trajectories, and tracking controllers based on quadratic programming [27] or feedback linearization [7] can be designed along these trajectories.
Reinforcement learning has also been applied to bipedal locomotion, results on hardware are demonstrated on either 2D bipeds [32, 19] or bipeds with large feet [38]. More recently, deep reinforcement learning has been applied to 3D bipedal locomotion problems [24, 43, 8]. However, these works have not yet shown results on a physical robot.
Iii Preliminaries
In this section we briefly outline the reinforcement learning and imitation learning framework.
Iiia Reinforcement Learning
In reinforcement learning, we wish to learn an optimal policy for a Markov Decision Process (MDP). The MDP is defined by a tuple , where are the state space and action space of the problem, the transition function is a probability density function, with being the probability density of given that at state , the system takes the action . The reward function gives a scalar reward for each transition of the system. is the discount factor. The goal of reinforcement learning is to find a policy , parameterized by , where is the probability density of given that solves the following optimization problem:
subject to  
Policy gradient algorithms [34] are a popular approach to solving this problem, where is estimated using onpolicy samples, i.e., using data collected from the current stochastic policy.
IiiB Imitation Learning
In imitation learning, we have an MDP as defined above, and an expert policy is given. The goal of imitation learning is to find a parametrized policy that minimizes the difference between and . More formally, we aim to solve the following optimization problem:
(1)  
subject to  
where is the probability density of with policy :
The expectation in the objective is often estimated by collecting a dataset of expert demonstrations. In behavior cloning, the expert policy is assumed to be deterministic. This causes the wellknown covariate shift problem, where the student policy will accumulate errors overtime and eventually drift to states that were not seen by the expert during data collection. Popular remedies to this issue include DAgger [29] and DART [17], which query the expert policy iteratively to augment the dataset.
Iv Methods
In this section, we present our method for collecting stateaction pairs as a dataset for imitation learning, and how this dataset can be used to combine imitation learning and reinforcement learning. In our iterativedesign framework, we will consider a previouslylearned policy as being the expert for the next iteration of policy optimization.
Iva Data Collection
If we assume and are Gaussian distributions with the same covariance, minimizing the imitation objective function (1) is equivalent to minimizing , where are the means of and . It is generally impractical to calculate this expectation exactly; in practice, we will collect an expert dataset of size , where is the state visited by the expert during policy execution, and minimize training error over , i.e, we will solve the following supervised learning problem:
(2) 
Note that during data collection, while we are recording only the mean of the policy, we are simulating a stochastic policy by adding noise to the mean during execution. This is related to [17], where adaptive noise is added to the expert policy to prevent covariate shift. In our setting, since we already know the distribution of our expert policy, we can avoid iteratively querying the expert policy with adaptive noise, and just query the expert once at the beginning. We refer to this data collection method as Deterministic Action Stochastic State (DASS), since we only collect a deterministic actions, but at states that are sampled from the stochastic policy. Algorithm 1 summarizes our data collection procedure.
From a control perspective, this method for collecting expert data can be interpreted as follows. For policies such as walking that produce a limit cycle trajectory, recording the actions of an expert with no noise, i.e., just using the deterministic mean actions, then the collected data only covers the limit cycle and the student will not observe the feedback that should be applied when the state is outside of the limit cycle. With the noise of the stochastic policy, the expert is further able to provide data on how to return to the limit cycle from states not on the cycle, and the student will be able to learn a feedback controller along this limit cycle. This is illustrated schematically in Fig. 2, where the blue curves represent the limit cycle produced by a deterministic policy, and the green arrows represent the deterministic feedback actions associated with the additional states resulting from the execution of the stochastic policy.
A key advantage of representing policies using the DASS tuples is that they can be combined in order to distill multiple specialized policies into one single policy. If we assume the desired skill specification is implicit in the state information, we can collect datasets corresponding to multiple experts , and use the union of these datasets for supervised learning.
IvB Combining Reinforcement Learning and Imitation Learning
Imitation learning with DASS provides us a sampleefficient method to recover and combine expert policies. However, a realistic design process necessitates further iteration, where we wish to train policies that are further refined with respect to some criteria, while also remaining close to the original expert policies. To achieve this goal, we will add a constraint in the original formulation of the reinforcement learning problem:
subject to  
To make this problem easier, we make the constraint on a soft constraint and rewrite the objective to be . At each iteration, we will estimate using the usual policy gradient algorithm, and update according to .
Note that the reward function need not be related to the expert skill. If we set , then we recover the imitation learning problem. Furthermore, we can learn new skills while not forgetting expert skills. For example, the expert can be a policy for a robot walking forward while is rewarding the robot to walk backward. If one has reason to believe that the expert policy is suboptimal, we can also use this method to finetune the expert policy by defining to be the same as what was used to train the expert policy. The benefit of this is that we don’t need access to the expert policy for the finetuning to happen. Finally, we can design rewards so that the new policy satisfies additional specific objectives that we desire, such as smoother movement or lifting the feet higher at each step.
V Experimental Setup
Va Robot Specification
We evaluate our methods using the Cassie bipedal robot. Cassie, shown in Fig. 3, is designed and built by Agility Robotics. It stands approximately 1 meter tall and has a total mass of 31 kg, with most of the weight concentrated in the pelvis. There are two leaf springs on each leg to make them more compliant. This introduces extra underactuation into the system, and it makes the control design more difficult for traditional techniques.
VB Training Framework
We adopt the framework used in [41] for training several initial policies , where we reward the agent for producing motion that approximately reproduces a set of specified reference motions. The input state to the policy is given by , where is the state of the robot that evolves according to the robot’s dynamics, and is the reference motion of the robot that evolves deterministically according to the motion we desire to track. The state of the robot includes the height, orientation expressed as a unit quaternion, velocities, angular velocities and acceleration of the pelvis, joint angles and joint velocities of each joint. In total, this gives us a input vector. We use the commonlyadopted Gaussian Policy as output, where the neural network will output the mean of the policy and Gaussian noise is injected on top of the action during execution. The output and the reference motion are summed to produce target joints angles for a low level PD controller. Instead of making the covariance of the Gaussian policy a learnable parameter, we use a fixed covariance for our policy. We assume that the Gaussian distribution in each dimension is independent, with a standard deviation of radians. A benefit of the fixed covariance is that because of the noise constantly injected into the system during training, the resulting policy will adapt itself to handle unmodeled disturbances during testing, as demonstrated in previous work [41, 25]. The network architecture is shown in Fig. 3. The policy is trained with an actorcritic algorithm using a simulated model of Cassie with the MuJoCo simulator [40], with the gradient of the policy estimated using Proximal Policy Optimization [33]. The simulator includes a detailed model of the robot’s rigidbodydynamics, including the reflected inertia of the robot’s motors, as well as empirically measured noise and delay for the robot’s sensors and actuators.
VC Policy Training
The design process we use for training our policies is summarized in Fig. 4. We initially train four different trackingbased policies: stepping in place; walking forward with speed ranging from ; walking backward with ; and fast walking forward with . The reference motions for the stepping in place and walking forward at motions are recorded from motion produced by an existing heuristically tuned controller, and the reference motions for walking at other speeds are obtained by scaling the translation and velocity of the walking forward motion by the desired value. There exist numerous other choices for obtaining reference motions, including using direct collocation [11] or key framing, but this is beyond the scope of this paper. The reference motions we work with are symmetric and the robot itself is nearly symmetric, and thus it is natural to enforce symmetry in the policies as well. We adopt a similar approach to [10], where we transform the input and output every half walking cycle to their symmetric form. During training, we apply reference state initialization and early termination techniques as suggested by Peng et al. [25], where each rollout is started from some states sampled from the reference motions and is terminated when the height of the pelvis is less than meters, or whenever the reward for any given timestep falls below a threshold of .
The initial tracking based policies are then used as the starting point for further design exploration. We show of these policies running on the physical robot in our supplementary video. Several intermediate policies are also successfully tested on the robot, but are not shown due to videoduration constraints. At each level, all policies are trained from scratch instead of finetuning the previous policies. This is important for distilling multiple policies together, and for policy compression on to a smaller network; in these cases, an original policy will not be available. We further note that finetuning a policy based on a new reward function often results in undesired changes to the policy as it can readily ”forget” the objectives and motion features of the starting policy.
VD Hardware Tests
We deploy a selection of trained policies on a physical Cassie robot. The state of the robot is estimated using sensor measurements from an IMU and joint encoders, which are fed into an Extended Kalman Filter to estimate parts of the robot’s state, such as the pelvis velocity. This process runs at 2 kHz on an embedded computer directly connected to the robot’s hardware. This information is sent over a pointtopoint Ethernet link to a secondary computer onboard the robot, which runs a standard Ubuntu operating system and executes the learned policy using the PyTorch framework. The policy updates its output joint PD targets once every 30 ms based on the latest state data and sends the targets back to the embedded computer over the Ethernet link. The embedded computer executes a PD control loop for each joint at the full 2 kHz rate, with targets updating every 30 ms based on new data from the policy execution.
Rapid deployment and testing is aided by the simulator using the same networkbased interface as the physical robot, which means that tests can be moved from simulation to hardware by copying files to the robot’s onboard computer and connecting to a different address. The robot has a short homing procedure performed after powering on, and can be left on in between testing different policies. The same filtering and estimation code as used on hardware is used internally in the simulator, rather than giving the policy direct access to the true simulation state. The network link between two computers introduces an additional 12 ms of latency beyond running the simulator and policy on the same machine, and many of the robot’s body masses are slightly different from the simulated robot due to imprecisely modeled cabling and electronics and minor modifications made to the robot since the simulation parameters were produced.
Vi Policy Compression and Distillation
In this section, we present results for using DASS to compress and distill multiple policies. In the experiment, we update the student policies using ADAM [16] with the supervised loss from Equation 2 with a batch size of . We collect an additional DASS samples for evaluating validation error. We stop the training when the training error improves less than over iterations.
Via Policy Compression
In deep reinforcement learning, network size often plays an important role in determining the end result [9]. It is further shown in [30] that for learning to play Atari game, a large network is necessary during RL training to achieve good performance, but that the final policy can be compressed using supervised learning without degrading the performance. For our problem, we also observe that using a larger network size for reinforcement learning improves learning efficiency as well as producing more robust policies, as shown in Fig. 5.
While we need a large network to efficiently do reinforcement learning, we find that we can compress the expert policy into a much smaller size network while maintaining the robustness of the final policy. With as little as samples, we can recover a stepping in place policy with a hidden layer size. Fig. 6 compares a dataset collected using behavior cloning with that of the DASS collection strategy. Table I compares policies trained using supervised learning across varying choices of hidden layer sizes, numbers of training samples, and the presence or absence of noise during data collection. With only samples, a large network can easily overfit the training data. We find that while larger networks can indeed have this issue, having validation error orders of magnitude larger than the training error, the resulting policy still performs comparably to the original policy in terms of robustness.
policy  training loss  validation loss  no noise  0.1 policy noise  20% mass noise  50N pushes 

expert  
300 samples  
600 samples  
no noise 
We successfully test the policy on the physical robot. It exhibits similar behavior as in simulation, with the robot stepping in place while supporting its own weight. However, the pelvis exhibits an undesirable shaky movement, both in simulation and on the physical robot, shown in the supplementary video. This corresponds to Policy A in Fig. 4.
ViB Policies Distillation
After training a network for a skill, we may want the policy to learn additional skills. In the context of the Cassie robot, we desire a control policy to not only step in place, but to also walk forwards and backwards on command. However, catastrophic forgetting can occur when trying to learn new skills. Distillation is one way to deal with forgetting. We can learn policies that master these skills separately and then distill these policies into a single policy with supervised learning. We distill the three expert policies that are trained separately for walking forward, stepping in place and walking backward into one policy that masters all three tasks. With samples collected from each of these policies, we are able to combine these policies into one policy with a hidden layer size of .
Vii Iterative Design with Changing Rewards
Viia Stable Pelvis Movement
As noted in the previous section, the tracking based policy results in an undesirable shaking of the robot body. While in simulation this does not affect the ability of the robot to complete its task, this places excessive wear on the physical robot hardware. We now apply the framework that combines policygradient RL updates and DASSbased policy cloning, in support of iterative policy design. The reward is dedicated to achieving stable movement of the robot body while maintaining desired speed. Specifically, we set the rewards to be , where is the angular velocity of the pelvis, and ensures a policy that tracks the desired velocity.
We first experiment with the simple approach of simply finetuning the previous policy using the new reward, with a desired velocity of . This results in the robot learning to stand still, and while this is a perfectly usable policy for this particular reward, it has effectively forgotten how to step in place. We next test learning that incorporates DASS samples into the policy update. To balance the number of DASS samples and onpolicy samples, on each iteration we train on 3000 DASS samples, using supervised learning and which are always drawn from the same set, and 3000 onpolicy policygradient samples. The resulting policy produces the desired steppinginplace motion with much smoother pelvis movement than the original tracking based policy. Fig. 7 shows the comparison of the norm of the angular velocity of the pelvis before and after the optimization. We then compress this policy by collecting 600 new DASS samples from this policy and test it on the physical robot. This corresponds to Policy B in Fig. 4 and can be seen in the supplementary video. The physical robot is able to step in place with a markedly smoother motion.
We extend this iterativeimprovement approach to an trackingbased policy that is capable of walking at different speeds, from stepping in place to walking forward at a maximum speed around . This policy suffers from the same problem as the trackingbased stepping in place policy, where the pelvis is shaking at a moderately high frequency. For each speed, We similarly collect 3000 DASS samples and 3000 policygradient samples, for a total of 10 speeds, – , taken at increments of . The final policy produces stable walking motion that can be commanded at various speeds, both in simulation and on the physical robot.
We distill this policy and two other trackingbased policies that can walk backwards at and forwards at with the stable pelvis reward. As before, we collect samples for these policies at a increments of , each with samples, and train a final unified policy that can walk at speeds from to with stable pelvis movement. We test this on the physical robot, and the robot is able to achieve on the treadmill as well as slow backwards walks. This policy is then further compressed to a hidden layer size network using supervised learning with DASS, with samples collected for each speed sampled at increments of . On the physical robot, this policy can achieve . We also compress this policy by collecting samples with speeds sampled at sparser increments of . The final policy has similar capabilities on the physical robot, although it is less responsive to commanded changes in speed. The policies before and after compression correspond to Policy C and Policy D in Fig. 4 and in the video.
ViiB Other Stylistic Reward
We experiment with additional stylistic rewards. We observe that the previous policies still exhibit noisy movement, and we thus optimize (solely) for reduced joint accelerations. As before, transfer from the previous policy is achieved using DASS sampling, which is then coupled with policygradient RL. Fig. 8 compares the difference of the motion before and after the optimization. We then compress this policy to a policy with hidden layer of size using the DASS samples. We also experiment with a reward to lift the feet of the robot higher. The previous policies lifts the feet up to cm during each step. We penalize the policy for lifting the foot less than cm. Guided by DASS samples from the previous policy, the new policy learns to lift the feet up to cm while maintaining good walking motions.
We test these policies on the physical robot. The motions on the robot are comparable to the motions in simulation. The policy that rewards low joint accelerations makes significantly softer and quieter ground contact. The policy optimized for lifting its feet higher achieves higher stepping. We further test the robustness of the highstepping policy by placing boards in front of the robot. The policy is able to cope with this unmodeled disturbance and recover after several steps, as shown in Fig. 9. These policies correspond to Policy E and Policy F in Fig. 4 and are also shown in the video.
Viii Conclusion and Discussion
In this paper, we present a data collection technique (DASS) that enables us to quickly recover, compress and distill multiple policies using supervised learning. Importantly, we demonstrate that DASSbased transfer learning can be integrated with policygradient RL methods. This directly supports an iterative design process, where each iteration of the design can optimize exclusively for a reward function that targets a desired change to the policy or motion style. We validate this approach on the bipedal robot Cassie, achieving stable walking motions with different styles at various speeds.
The final policies obtained are robust to unmodeled noise and enable us to transfer them from simulation to the physical robot without difficulty. This differs from most simtoreal results, where a large range of dynamic randomization is often needed to ensure successful transfer, despite performing careful system identification [35] and using quadrupedal systems that may have more passive stability than a humanscale biped.
We show that the policies can be robust without resorting to dynamics randomization. This is shown in simulation, where mass of the pelvis is perturbed by , and by the successful transfer to the physical robot, which exhibits changing dynamics during its operation cycle. We hypothesize the robustness stems from learning stochastic policies that operate at a low control rate, allowing the final policies to adapt to other noise. It will be interesting to further identify what are the most important considerations that ensure simtoreal success instead of always requiring dynamic randomization, which can cause the final policy to be overlyconservative. We do note that the physical robot experiments exhibit asymmetric step lengths to a degree that is not seen in simulation. The source of this remains to be determined, but the policies are robust despite this difference.
We show that while it is beneficial to use a relative large neural network during the reinforcement learning phase, the final policies can usually be represented by much smaller networks. It will be interesting to learn abstractions relevant to locomotion and to be able to reuse such abstractions for more efficient learning and planning.
An advantage of using deep neural networks is that they can be readily extended to develop policies that directly accept rich perceptual input such as images, unlike traditional control methods. We wish to give Cassie such visual input and use this in support of visual navigation.
Our policies currently still take a reference motions as an input. However, once the initial tracking policy has been trained, the policy is free to develop its own movement styles according to the subsequent iterative optimizations, and will even learn to stand still if given an appropriate reward function. The reference motion does provide the policies a means to condition themselves on time or motion phase, similar to [25]. In some sense, this gives the policy more flexibility since with enough motions and offline training, it can be capable of generalizing to other unseen reference motions during testing. In another sense, the reference motion poses constraints on the final motions, i.e., it may be more difficult for a motion to make timing adjustments. Given that pure statebased feedback can also yield plausible locomotion, e.g., [43], it will be interesting to seek a balance between these two representations.
Acknowledgments
References
 Apgar et al. [2018] Taylor Apgar, Patrick Clary, Kevin Green, Alan Fern, and Jonathan Hurst. Fast online trajectory optimization for the bipedal robot cassie. In Robotics: Science and Systems, 2018.
 Berseth et al. [2018] Glen Berseth, Cheng Xie, Paul Cernek, and Michiel van de Panne. Progressive reinforcement learning with distillation for multiskilled motion control. In Proc. International Conference on Learning Representations, 2018.
 Chen et al. [2018] SiAn Chen, Voot Tangkaratt, HsuanTien Lin, and Masashi Sugiyama. Active Deep Qlearning with Demonstration. arXiv eprints, art. arXiv:1812.02632, December 2018.
 Chen et al. [2018] Xi Chen, Ali Ghadirzadeh, John Folkesson, and Patric Jensfelt. Deep reinforcement learning to acquire navigation skills for wheellegged robots in complex environments. CoRR, abs/1804.10500, 2018. URL http://arxiv.org/abs/1804.10500.
 Clavera et al. [2018] Ignasi Clavera, Anusha Nagabandi, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt: Metalearning for modelbased control. CoRR, abs/1803.11347, 2018. URL http://arxiv.org/abs/1803.11347.
 Da et al. [2017] X. Da, R. Hartley, and J. W. Grizzle. Supervised learning for stabilizing underactuated bipedal robot locomotion, with outdoor experiments on the wave field. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 3476–3483, May 2017. doi: 10.1109/ICRA.2017.7989397.
 Gong et al. [2018] Yukai Gong, Ross Hartley, Xingye Da, Ayonga Hereid, Omar Harib, JiunnKai Huang, and Jessy W. Grizzle. Feedback control of a cassie bipedal robot: Walking, standing, and riding a segway. CoRR, abs/1809.07279, 2018. URL http://arxiv.org/abs/1809.07279.
 Heess et al. [2017] Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin A. Riedmiller, and David Silver. Emergence of locomotion behaviours in rich environments. CoRR, abs/1707.02286, 2017. URL http://arxiv.org/abs/1707.02286.
 Henderson et al. [2018] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In AAAI, 2018.
 Hereid et al. [2018a] A. Hereid, C. M. Hubicki, E. A. Cousineau, and A. D. Ames. Dynamic humanoid locomotion: A scalable formulation for hzd gait optimization. IEEE Transactions on Robotics, 34(2):370–387, April 2018a. ISSN 15523098. doi: 10.1109/TRO.2017.2783371.
 Hereid et al. [2018b] Ayonga Hereid, Omar Harib, Ross Hartley, Yukai Gong, and Jessy W. Grizzle. Rapid bipedal gait design using CFROST with illustration on a cassieseries robot. CoRR, abs/1807.06614, 2018b. URL http://arxiv.org/abs/1807.06614.
 Hirai et al. [1998] K. Hirai, M. Hirose, Y. Haikawa, and T. Takenaka. The development of honda humanoid robot. In Proceedings. 1998 IEEE International Conference on Robotics and Automation (Cat. No.98CH36146), volume 2, pages 1321–1326 vol.2, May 1998. doi: 10.1109/ROBOT.1998.677288.
 Ho and Ermon [2016] Jonathan Ho and Stefano Ermon. Generative Adversarial Imitation Learning. arXiv eprints, art. arXiv:1606.03476, June 2016.
 Hwangbo et al. [2019] Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots. Science Robotics, 4(26), 2019. doi: 10.1126/scirobotics.aau5872. URL http://robotics.sciencemag.org/content/4/26/eaau5872.
 Kajita et al. [2001] S. Kajita, F. Kanehiro, K. Kaneko, K. Yokoi, and H. Hirukawa. The 3d linear inverted pendulum mode: a simple modeling for a biped walking pattern generation. In Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180), volume 1, pages 239–246 vol.1, Oct 2001. doi: 10.1109/IROS.2001.973365.
 Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
 Laskey et al. [2017] Michael Laskey, Jonathan Lee, Roy Fox, Anca D. Dragan, and Kenneth Y. Goldberg. Dart: Noise injection for robust imitation learning. In CoRL, 2017.
 Levine and Koltun [2013] Sergey Levine and Vladlen Koltun. Guided policy search. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1–9, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/levine13.html.
 Li et al. [2018] Tianyu Li, Akshara Rai, Hartmut Geyer, and Christopher G. Atkeson. Using deep reinforcement learning to learn highlevel policies on the ATRIAS biped. CoRR, abs/1809.10811, 2018. URL http://arxiv.org/abs/1809.10811.
 Liu et al. [2013] Chenggang Liu, Christopher G. Atkeson, and Jianbo Su. Biped walking control using a trajectory library. Robotica, 31(2):311â322, 2013. doi: 10.1017/S0263574712000203.
 Merel et al. [2018] Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. CoRR, abs/1811.11711, 2018. URL http://arxiv.org/abs/1811.11711.
 Nakanishi et al. [2004] Jun Nakanishi, Jun Morimoto, Gen Endo, Gordon Cheng, Stefan Schaal, and Mitsuo Kawato. Learning from demonstration and adaptation of biped locomotion. Robotics and Autonomous Systems, 47(2):79 – 91, 2004. ISSN 09218890. doi: https://doi.org/10.1016/j.robot.2004.03.003. URL http://www.sciencedirect.com/science/article/pii/S0921889004000399. Robot Learning from Demonstration.
 Parisotto et al. [2015] Emilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Actormimic: Deep multitask and transfer reinforcement learning. CoRR, abs/1511.06342, 2015. URL http://arxiv.org/abs/1511.06342.
 Peng et al. [2017] Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics (Proc. SIGGRAPH 2017), 36(4), 2017.
 Peng et al. [2018] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Exampleguided deep reinforcement learning of physicsbased character skills. ACM Transactions on Graphics (Proc. SIGGRAPH 2018), 37(4), 2018.
 Pomerleau [1988] Dean Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In NIPS, 1988.
 Posa et al. [2016] Michael Posa, Scott Kuindersma, and Russ Tedrake. Optimization and stabilization of trajectories for constrained dynamical systems. In Proceedings of the International Conference on Robotics and Automation, 2016.
 Ross and Bagnell [2010] Stephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 661–668, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL http://proceedings.mlr.press/v9/ross10a.html.
 Ross et al. [2011] Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In AISTATS, 2011.
 Rusu et al. [2015] Andrei A. Rusu, Sergio Gomez Colmenarejo, Çaglar Gülçehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. CoRR, abs/1511.06295, 2015. URL http://arxiv.org/abs/1511.06295.
 Schaal et al. [2003] S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert. Control, planning, learning, and imitation with dynamic movement primitives. In IROS 2003, pages 1–21. MaxPlanckGesellschaft, October 2003.
 Schuitema et al. [2010] E. Schuitema, M. Wisse, T. Ramakers, and P. Jonker. The design of leo: A 2d bipedal walking robot for online autonomous reinforcement learning. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3238–3243, Oct 2010. doi: 10.1109/IROS.2010.5650765.
 Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
 Sutton et al. [1999] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 1999.
 Tan et al. [2018] Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Simtoreal: Learning agile locomotion for quadruped robots. CoRR, abs/1804.10332, 2018.
 Tang and Hauser [2017] G. Tang and K. Hauser. A datadriven indirect method for nonlinear optimal control. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4854–4861, Sept 2017. doi: 10.1109/IROS.2017.8206362.
 Tang et al. [2018] Gao Tang, Weidong Sun, and Kris Hauser. Learning trajectories for realtime optimal control of quadrotors. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018.
 Tedrake et al. [2004] Russ Tedrake, Teresa Weirui Zhang, and H. Sebastian Seung. Stochastic policy gradient reinforcement learning on a simple 3d biped. 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), 3:2849–2854 vol.3, 2004.
 Tedrake et al. [2015] Russ Tedrake, Scott Kuindersma, Robin Deits, and Kanako Miura. A closedform solution for realtime zmp gait generation and feedback stabilization. In IEEERAS International Conference on Humanoid Robots, Seoul, Korea, 2015.
 Todorov et al. [2012] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, Oct 2012. doi: 10.1109/IROS.2012.6386109.
 Xie et al. [2018] Zhaoming Xie, Glen Berseth, Patrick Clary, Jonathan Hurst, and Michiel van de Panne. Feedback control for cassie with deep reinforcement learning. In Proc. IEEE/RSJ Intl Conf on Intelligent Robots and Systems (IROS 2018), 2018.
 Xiong and Ames [2018] Xiaobin Xiong and Aaron D. Ames. Coupling reduced order models via feedback control for 3d underactuated bipedal robotic walking. IEEERAS 18th International Conference on Humanoid Robots (Humanoids), 2018. URL http://ames.caltech.edu/xiong2018coupling.pdf.
 Yu et al. [2018] Wenhao Yu, Greg Turk, and C. Karen Liu. Learning symmetric and lowenergy locomotion. ACM Transactions on Graphics (Proc. SIGGRAPH 2018  to appear), 37(4), 2018.