Iterative Reinforcement Learning Based Design of Dynamic Locomotion Skills for Cassie

Iterative Reinforcement Learning Based Design of
Dynamic Locomotion Skills for Cassie

Author Names Omitted for Anonymous Review. Paper-ID 46    \authorblockNZhaoming Xie\authorrefmark1, Patrick Clary\authorrefmark2, Jeremy Dao\authorrefmark2, Pedro Morais\authorrefmark2, Jonathan Hurst\authorrefmark2 and Michiel van de Panne\authorrefmark1 \authorblockA\authorrefmark1Department of Computer Science
University of British Columbia, Vancouver, BC, Canda
Email: {zxie47, van} \authorblockA\authorrefmark2Dynamic Robotics Laboratory
Oregon State University, Corvallis, Oregon, USA
Email: {claryp,daoje,autranep,jonathan.hurst}

Deep reinforcement learning (DRL) is a promising approach for developing legged locomotion skills. However, the iterative design process that is inevitable in practice is poorly supported by the default methodology. It is difficult to predict the outcomes of changes made to the reward functions, policy architectures, and the set of tasks being trained on. In this paper, we propose a practical method that allows the reward function to be fully redefined on each successive design iteration while limiting the deviation from the previous iteration. We characterize policies via sets of Deterministic Action Stochastic State (DASS) tuples, which represent the deterministic policy state-action pairs as sampled from the states visited by the trained stochastic policy. New policies are trained using a policy gradient algorithm which then mixes RL-based policy gradients with gradient updates defined by the DASS tuples. The tuples also allow for robust policy distillation to new network architectures. We demonstrate the effectiveness of this iterative-design approach on the bipedal robot Cassie, achieving stable walking with different gait styles at various speeds. We demonstrate the successful transfer of policies learned in simulation to the physical robot without any dynamics randomization, and that variable-speed walking policies for the physical robot can be represented by a small dataset of 5-10k tuples.

I Introduction

Recent success in deep reinforcement learning (DRL) has inspired much work towards constructing locomotion policies for legged robots. Impressive results have been demonstrated on planar bipeds [19], quadruped robots, [35, 14], and 6-legged robots [5]. However, these systems are relatively stable in comparison to human-scale bipeds, for which convincing demonstrations of DRL methods to dynamic locomotion on real hardware are still lacking, to the best of our knowledge. Nevertheless, a variety of results in simulation point to the promise of DRL methods in this area, e.g.,  [43, 24, 41, 8]. In practice, a multitude of issues can preclude the successful transfer of policies from simulation to the physical robot, including state estimation, modeling discrepancies, and motions that can cause excessive wear on hardware.

In this paper, we propose a DRL design process that reflects and supports the iterative nature of control policy design. At its heart is a data collection technique that allows us to recover a trained policy from a relatively small number of samples. With this technique, we can quickly compress and combine locomotion policies with supervised learning. By using policy gradient updates that combine the supervised learning samples and conventional DRL policy-gradient samples, we allow for the iterative design of improved policies using new reward functions that encourage desired behaviors. We validate our approach in simulation and on a physical Cassie robot, demonstrating stable walking policies with different styles at various speeds. Frames from a learned forward-walking gait for Cassie are shown in Fig. 1.

Fig. 1: Cassie walking on a treadmill with a neural network policy.

To summarize, this paper makes the following contributions:

  • We present a simple-yet-effective technique to reconstruct policies from only a small number of samples, and show that robust variable-speed walking policies can be achieved on physical hardware using datasets of 5-10k tuples taken from simulations.

  • We combine reinforcement learning with supervised learning from this small number of samples. Guided by these samples, new policies can be learned by designing new reward functions that define desired changes to the behaviors while staying close to the original policy. This offers a strong alternative to ”fine-tuning” approaches, where an existing policy may be adapted via small changes and additions to an existing reward function, but which results in ever-more cumbersome reward functions and may exhibit unexpected changes in behavior.

  • We apply this approach to train various locomotion policies on a simulated model of the bipedal robot Cassie. These policies are successfully transferred to the physical robot without any further tuning or adaptation.

To the best of our knowledge, we believe this is the first time that neural network policies for variable-speed locomotion have been successfully deployed on a human-scale bipedal robot such Cassie. The policies trained in simulation are directly transferred to the physical robot without the use of the dynamics randomization methods. The gaits are comparable or faster in speed than other gaits reported in the literature for Cassie, e.g., [7].

Ii Related Work

Ii-a Supervised Learning for Trajectory Optimization

There exists significant prior art on generating optimal trajectories using supervised learning. This can be made fast and robust by precomputing a library of solutions and using a “warm start” for new problems using nearest neighbors, e.g., Liu et al. [20], Tang and Hauser [36]. Regression using neural networks has been used to generate optimal trajectories for bipedal robots [6] and quadrotors [37]. Guided Policy Search [18] makes use of solutions from trajectory optimization to guide the policy search.

Ii-B Imitation Learning

Imitation learning seeks to approximate an expert policy. In its simplest form, one can collect a sufficiently large sample of state-action pairs from the expert and apply supervised learning, which is also referred to as behavior cloning, as used in early seminal autonomous driving work by  Pomerleau [26]. However, due to issues of compounding errors and covariate shift, this method often leads to failure [28]. The DAGGER method [29] is proposed to solve this problem, where the expert policy is iteratively queried to augment the expert dataset. Laskey et al. [17] injected adaptive noise into the expert policy to reduce expert queries. Recent work learns a linear approximation of the expert policy [21]. Another line of work is to formulate imitation learning problems as reinforcement learning problems by inferring the reward signal from expert demonstration using methods such as GAIL [13]. Expert trajectories can also be stored in a reinforcement learning agent’s experience buffer to accelerate the reinforcement learning process [3, 4]. Dynamic Movement Primitives [31, 22] provide another approach to incorporating expert demonstration to learn motor skills.

Ii-C Distillation

Supervised learning is often used to combine multiple policies. For example, it is successfully used to train policies to play multiple Atari games [30, 23]. More recently, Berseth et al. [2] use it to train a simulated 2D humanoid to traverse different types of terrains. These methods still suffer from the covariate shift problem and need to use DAGGER in the process.

Ii-D Bipedal Locomotion

Bipedal locomotion skills are important for robots to be able to traverse terrains that are typical in human environments. Many methods use the Zero Moment Point (ZMP) to plan stable walking motions, e.g., [12, 39]. Low dimensional models such as linear inverted pendulum (LIP) and spring loaded inverted pendulum (SLIP) can be used to simplify the robot dynamics [15, 1, 42] for easier and faster planning. To utilize the full dynamics of the robots, offline trajectory optimization such as direct collocation [11] is often used to generate trajectories, and tracking controllers based on quadratic programming [27] or feedback linearization [7] can be designed along these trajectories.

Reinforcement learning has also been applied to bipedal locomotion, results on hardware are demonstrated on either 2D bipeds [32, 19] or bipeds with large feet [38]. More recently, deep reinforcement learning has been applied to 3D bipedal locomotion problems [24, 43, 8]. However, these works have not yet shown results on a physical robot.

Iii Preliminaries

In this section we briefly outline the reinforcement learning and imitation learning framework.

Iii-a Reinforcement Learning

In reinforcement learning, we wish to learn an optimal policy for a Markov Decision Process (MDP). The MDP is defined by a tuple , where are the state space and action space of the problem, the transition function is a probability density function, with being the probability density of given that at state , the system takes the action . The reward function gives a scalar reward for each transition of the system. is the discount factor. The goal of reinforcement learning is to find a policy , parameterized by , where is the probability density of given that solves the following optimization problem:

subject to

Policy gradient algorithms [34] are a popular approach to solving this problem, where is estimated using on-policy samples, i.e., using data collected from the current stochastic policy.

Iii-B Imitation Learning

In imitation learning, we have an MDP as defined above, and an expert policy is given. The goal of imitation learning is to find a parametrized policy that minimizes the difference between and . More formally, we aim to solve the following optimization problem:

subject to

where is the probability density of with policy :

The expectation in the objective is often estimated by collecting a dataset of expert demonstrations. In behavior cloning, the expert policy is assumed to be deterministic. This causes the well-known covariate shift problem, where the student policy will accumulate errors overtime and eventually drift to states that were not seen by the expert during data collection. Popular remedies to this issue include DAgger [29] and DART [17], which query the expert policy iteratively to augment the dataset.

Iv Methods

In this section, we present our method for collecting state-action pairs as a dataset for imitation learning, and how this dataset can be used to combine imitation learning and reinforcement learning. In our iterative-design framework, we will consider a previously-learned policy as being the expert for the next iteration of policy optimization.

Iv-a Data Collection

If we assume and are Gaussian distributions with the same covariance, minimizing the imitation objective function (1) is equivalent to minimizing , where are the means of and . It is generally impractical to calculate this expectation exactly; in practice, we will collect an expert dataset of size , where is the state visited by the expert during policy execution, and minimize training error over , i.e, we will solve the following supervised learning problem:


Note that during data collection, while we are recording only the mean of the policy, we are simulating a stochastic policy by adding noise to the mean during execution. This is related to [17], where adaptive noise is added to the expert policy to prevent covariate shift. In our setting, since we already know the distribution of our expert policy, we can avoid iteratively querying the expert policy with adaptive noise, and just query the expert once at the beginning. We refer to this data collection method as Deterministic Action Stochastic State (DASS), since we only collect a deterministic actions, but at states that are sampled from the stochastic policy. Algorithm 1 summarizes our data collection procedure.

2:Reset from some initial state distribution
3:for  do
5:     ,
6:     if  for some termination set  then
Algorithm 1 DASS

From a control perspective, this method for collecting expert data can be interpreted as follows. For policies such as walking that produce a limit cycle trajectory, recording the actions of an expert with no noise, i.e., just using the deterministic mean actions, then the collected data only covers the limit cycle and the student will not observe the feedback that should be applied when the state is outside of the limit cycle. With the noise of the stochastic policy, the expert is further able to provide data on how to return to the limit cycle from states not on the cycle, and the student will be able to learn a feedback controller along this limit cycle. This is illustrated schematically in Fig. 2, where the blue curves represent the limit cycle produced by a deterministic policy, and the green arrows represent the deterministic feedback actions associated with the additional states resulting from the execution of the stochastic policy.

Fig. 2: A walking policy produces a limit cycle, represented by the blue closed curve, and the green arrows indicate the required feedback to return to the limit cycle.

A key advantage of representing policies using the DASS tuples is that they can be combined in order to distill multiple specialized policies into one single policy. If we assume the desired skill specification is implicit in the state information, we can collect datasets corresponding to multiple experts , and use the union of these datasets for supervised learning.

Iv-B Combining Reinforcement Learning and Imitation Learning

Imitation learning with DASS provides us a sample-efficient method to recover and combine expert policies. However, a realistic design process necessitates further iteration, where we wish to train policies that are further refined with respect to some criteria, while also remaining close to the original expert policies. To achieve this goal, we will add a constraint in the original formulation of the reinforcement learning problem:

subject to

To make this problem easier, we make the constraint on a soft constraint and rewrite the objective to be . At each iteration, we will estimate using the usual policy gradient algorithm, and update according to .

Note that the reward function need not be related to the expert skill. If we set , then we recover the imitation learning problem. Furthermore, we can learn new skills while not forgetting expert skills. For example, the expert can be a policy for a robot walking forward while is rewarding the robot to walk backward. If one has reason to believe that the expert policy is suboptimal, we can also use this method to fine-tune the expert policy by defining to be the same as what was used to train the expert policy. The benefit of this is that we don’t need access to the expert policy for the fine-tuning to happen. Finally, we can design rewards so that the new policy satisfies additional specific objectives that we desire, such as smoother movement or lifting the feet higher at each step.

V Experimental Setup

V-a Robot Specification

We evaluate our methods using the Cassie bipedal robot. Cassie, shown in Fig. 3, is designed and built by Agility Robotics. It stands approximately 1 meter tall and has a total mass of 31 kg, with most of the weight concentrated in the pelvis. There are two leaf springs on each leg to make them more compliant. This introduces extra underactuation into the system, and it makes the control design more difficult for traditional techniques.

Fig. 3: Left: The bipedal robot Cassie used for evaluation. The red arrows indicate the axes of actuated joints, the yellow arrows indicate passive joints with stiff leaf springs attached for compliance. Right: The neural network used to parameterize the policy.
Fig. 4: Our policy design process. Four tracking-based policies are used as a starting point. DASS samples are passed from one policy to the next according to the arrows.

V-B Training Framework

We adopt the framework used in [41] for training several initial policies , where we reward the agent for producing motion that approximately reproduces a set of specified reference motions. The input state to the policy is given by , where is the state of the robot that evolves according to the robot’s dynamics, and is the reference motion of the robot that evolves deterministically according to the motion we desire to track. The state of the robot includes the height, orientation expressed as a unit quaternion, velocities, angular velocities and acceleration of the pelvis, joint angles and joint velocities of each joint. In total, this gives us a input vector. We use the commonly-adopted Gaussian Policy as output, where the neural network will output the mean of the policy and Gaussian noise is injected on top of the action during execution. The output and the reference motion are summed to produce target joints angles for a low level PD controller. Instead of making the covariance of the Gaussian policy a learnable parameter, we use a fixed covariance for our policy. We assume that the Gaussian distribution in each dimension is independent, with a standard deviation of radians. A benefit of the fixed covariance is that because of the noise constantly injected into the system during training, the resulting policy will adapt itself to handle unmodeled disturbances during testing, as demonstrated in previous work [41, 25]. The network architecture is shown in Fig. 3. The policy is trained with an actor-critic algorithm using a simulated model of Cassie with the MuJoCo simulator [40], with the gradient of the policy estimated using Proximal Policy Optimization [33]. The simulator includes a detailed model of the robot’s rigid-body-dynamics, including the reflected inertia of the robot’s motors, as well as empirically measured noise and delay for the robot’s sensors and actuators.

V-C Policy Training

The design process we use for training our policies is summarized in Fig. 4. We initially train four different tracking-based policies: stepping in place; walking forward with speed ranging from ; walking backward with ; and fast walking forward with . The reference motions for the stepping in place and walking forward at motions are recorded from motion produced by an existing heuristically tuned controller, and the reference motions for walking at other speeds are obtained by scaling the translation and velocity of the walking forward motion by the desired value. There exist numerous other choices for obtaining reference motions, including using direct collocation [11] or key framing, but this is beyond the scope of this paper. The reference motions we work with are symmetric and the robot itself is nearly symmetric, and thus it is natural to enforce symmetry in the policies as well. We adopt a similar approach to [10], where we transform the input and output every half walking cycle to their symmetric form. During training, we apply reference state initialization and early termination techniques as suggested by Peng et al. [25], where each rollout is started from some states sampled from the reference motions and is terminated when the height of the pelvis is less than meters, or whenever the reward for any given timestep falls below a threshold of .

The initial tracking based policies are then used as the starting point for further design exploration. We show of these policies running on the physical robot in our supplementary video. Several intermediate policies are also successfully tested on the robot, but are not shown due to video-duration constraints. At each level, all policies are trained from scratch instead of fine-tuning the previous policies. This is important for distilling multiple policies together, and for policy compression on to a smaller network; in these cases, an original policy will not be available. We further note that fine-tuning a policy based on a new reward function often results in undesired changes to the policy as it can readily ”forget” the objectives and motion features of the starting policy.

V-D Hardware Tests

We deploy a selection of trained policies on a physical Cassie robot. The state of the robot is estimated using sensor measurements from an IMU and joint encoders, which are fed into an Extended Kalman Filter to estimate parts of the robot’s state, such as the pelvis velocity. This process runs at 2 kHz on an embedded computer directly connected to the robot’s hardware. This information is sent over a point-to-point Ethernet link to a secondary computer onboard the robot, which runs a standard Ubuntu operating system and executes the learned policy using the PyTorch framework. The policy updates its output joint PD targets once every 30 ms based on the latest state data and sends the targets back to the embedded computer over the Ethernet link. The embedded computer executes a PD control loop for each joint at the full 2 kHz rate, with targets updating every 30 ms based on new data from the policy execution.

Rapid deployment and testing is aided by the simulator using the same network-based interface as the physical robot, which means that tests can be moved from simulation to hardware by copying files to the robot’s onboard computer and connecting to a different address. The robot has a short homing procedure performed after powering on, and can be left on in between testing different policies. The same filtering and estimation code as used on hardware is used internally in the simulator, rather than giving the policy direct access to the true simulation state. The network link between two computers introduces an additional 1-2 ms of latency beyond running the simulator and policy on the same machine, and many of the robot’s body masses are slightly different from the simulated robot due to imprecisely modeled cabling and electronics and minor modifications made to the robot since the simulation parameters were produced.

Vi Policy Compression and Distillation

In this section, we present results for using DASS to compress and distill multiple policies. In the experiment, we update the student policies using ADAM [16] with the supervised loss from Equation 2 with a batch size of . We collect an additional DASS samples for evaluating validation error. We stop the training when the training error improves less than over iterations.

Vi-a Policy Compression

In deep reinforcement learning, network size often plays an important role in determining the end result [9]. It is further shown in [30] that for learning to play Atari game, a large network is necessary during RL training to achieve good performance, but that the final policy can be compressed using supervised learning without degrading the performance. For our problem, we also observe that using a larger network size for reinforcement learning improves learning efficiency as well as producing more robust policies, as shown in Fig. 5.

Fig. 5: Network sizes impact the final result for reinforcement learning. We observe that larger network sizes typically learn faster and yield more stable policies. Compared to the network, the learning proceeds much more slowly for network sizes of and , and has a larger variance, indicating the final policy is not robust to noise.

While we need a large network to efficiently do reinforcement learning, we find that we can compress the expert policy into a much smaller size network while maintaining the robustness of the final policy. With as little as samples, we can recover a stepping in place policy with a hidden layer size. Fig. 6 compares a dataset collected using behavior cloning with that of the DASS collection strategy. Table I compares policies trained using supervised learning across varying choices of hidden layer sizes, numbers of training samples, and the presence or absence of noise during data collection. With only samples, a large network can easily overfit the training data. We find that while larger networks can indeed have this issue, having validation error orders of magnitude larger than the training error, the resulting policy still performs comparably to the original policy in terms of robustness.

Fig. 6: Joint angles of the left knee in the expert (teacher) dataset, as collected via policy cloning or DASS. Behavior cloning only visits a limited set of states, namely those very near the limit cycle.
policy training loss validation loss no noise 0.1 policy noise 20% mass noise 50N pushes
300 samples
600 samples
no noise
TABLE I: Comparison of policies trained with various settings. The default hidden layer size is . We evaluate the robustness of each policy by injecting noise of varying magnitude to the policy actions, increasing the mass of the pelvis by , and applying pushes of in the forward direction for second every seconds, and report the cumulative rewards each policy obtained over control steps.

We successfully test the policy on the physical robot. It exhibits similar behavior as in simulation, with the robot stepping in place while supporting its own weight. However, the pelvis exhibits an undesirable shaky movement, both in simulation and on the physical robot, shown in the supplementary video. This corresponds to Policy A in Fig. 4.

Vi-B Policies Distillation

After training a network for a skill, we may want the policy to learn additional skills. In the context of the Cassie robot, we desire a control policy to not only step in place, but to also walk forwards and backwards on command. However, catastrophic forgetting can occur when trying to learn new skills. Distillation is one way to deal with forgetting. We can learn policies that master these skills separately and then distill these policies into a single policy with supervised learning. We distill the three expert policies that are trained separately for walking forward, stepping in place and walking backward into one policy that masters all three tasks. With samples collected from each of these policies, we are able to combine these policies into one policy with a hidden layer size of .

Vii Iterative Design with Changing Rewards

Vii-a Stable Pelvis Movement

As noted in the previous section, the tracking based policy results in an undesirable shaking of the robot body. While in simulation this does not affect the ability of the robot to complete its task, this places excessive wear on the physical robot hardware. We now apply the framework that combines policy-gradient RL updates and DASS-based policy cloning, in support of iterative policy design. The reward is dedicated to achieving stable movement of the robot body while maintaining desired speed. Specifically, we set the rewards to be , where is the angular velocity of the pelvis, and ensures a policy that tracks the desired velocity.

We first experiment with the simple approach of simply fine-tuning the previous policy using the new reward, with a desired velocity of . This results in the robot learning to stand still, and while this is a perfectly usable policy for this particular reward, it has effectively forgotten how to step in place. We next test learning that incorporates DASS samples into the policy update. To balance the number of DASS samples and on-policy samples, on each iteration we train on 3000 DASS samples, using supervised learning and which are always drawn from the same set, and 3000 on-policy policy-gradient samples. The resulting policy produces the desired stepping-in-place motion with much smoother pelvis movement than the original tracking based policy. Fig. 7 shows the comparison of the norm of the angular velocity of the pelvis before and after the optimization. We then compress this policy by collecting 600 new DASS samples from this policy and test it on the physical robot. This corresponds to Policy B in Fig. 4 and can be seen in the supplementary video. The physical robot is able to step in place with a markedly smoother motion.

Fig. 7: Comparison of the norm of the angular velocity of the pelvis before and after optimization.

We extend this iterative-improvement approach to an tracking-based policy that is capable of walking at different speeds, from stepping in place to walking forward at a maximum speed around . This policy suffers from the same problem as the tracking-based stepping in place policy, where the pelvis is shaking at a moderately high frequency. For each speed, We similarly collect 3000 DASS samples and 3000 policy-gradient samples, for a total of 10 speeds, , taken at increments of . The final policy produces stable walking motion that can be commanded at various speeds, both in simulation and on the physical robot.

We distill this policy and two other tracking-based policies that can walk backwards at and forwards at with the stable pelvis reward. As before, we collect samples for these policies at a increments of , each with samples, and train a final unified policy that can walk at speeds from to with stable pelvis movement. We test this on the physical robot, and the robot is able to achieve on the treadmill as well as slow backwards walks. This policy is then further compressed to a hidden layer size network using supervised learning with DASS, with samples collected for each speed sampled at increments of . On the physical robot, this policy can achieve . We also compress this policy by collecting samples with speeds sampled at sparser increments of . The final policy has similar capabilities on the physical robot, although it is less responsive to commanded changes in speed. The policies before and after compression correspond to Policy C and Policy D in Fig. 4 and in the video.

Vii-B Other Stylistic Reward

We experiment with additional stylistic rewards. We observe that the previous policies still exhibit noisy movement, and we thus optimize (solely) for reduced joint accelerations. As before, transfer from the previous policy is achieved using DASS sampling, which is then coupled with policy-gradient RL. Fig. 8 compares the difference of the motion before and after the optimization. We then compress this policy to a policy with hidden layer of size using the DASS samples. We also experiment with a reward to lift the feet of the robot higher. The previous policies lifts the feet up to  cm during each step. We penalize the policy for lifting the foot less than  cm. Guided by DASS samples from the previous policy, the new policy learns to lift the feet up to  cm while maintaining good walking motions.

We test these policies on the physical robot. The motions on the robot are comparable to the motions in simulation. The policy that rewards low joint accelerations makes significantly softer and quieter ground contact. The policy optimized for lifting its feet higher achieves higher stepping. We further test the robustness of the high-stepping policy by placing boards in front of the robot. The policy is able to cope with this unmodeled disturbance and recover after several steps, as shown in Fig. 9. These policies correspond to Policy E and Policy F in Fig. 4 and are also shown in the video.

Fig. 8: Phase portrait for all the joints on the left leg during step in place. The blue curve is before optimizing for less joint accelerations, and the green curve is after.
Fig. 9: Cassie recovers from stepping on an unexpected obstacle.

Viii Conclusion and Discussion

In this paper, we present a data collection technique (DASS) that enables us to quickly recover, compress and distill multiple policies using supervised learning. Importantly, we demonstrate that DASS-based transfer learning can be integrated with policy-gradient RL methods. This directly supports an iterative design process, where each iteration of the design can optimize exclusively for a reward function that targets a desired change to the policy or motion style. We validate this approach on the bipedal robot Cassie, achieving stable walking motions with different styles at various speeds.

The final policies obtained are robust to unmodeled noise and enable us to transfer them from simulation to the physical robot without difficulty. This differs from most sim-to-real results, where a large range of dynamic randomization is often needed to ensure successful transfer, despite performing careful system identification [35] and using quadrupedal systems that may have more passive stability than a human-scale biped.

We show that the policies can be robust without resorting to dynamics randomization. This is shown in simulation, where mass of the pelvis is perturbed by , and by the successful transfer to the physical robot, which exhibits changing dynamics during its operation cycle. We hypothesize the robustness stems from learning stochastic policies that operate at a low control rate, allowing the final policies to adapt to other noise. It will be interesting to further identify what are the most important considerations that ensure sim-to-real success instead of always requiring dynamic randomization, which can cause the final policy to be overly-conservative. We do note that the physical robot experiments exhibit asymmetric step lengths to a degree that is not seen in simulation. The source of this remains to be determined, but the policies are robust despite this difference.

We show that while it is beneficial to use a relative large neural network during the reinforcement learning phase, the final policies can usually be represented by much smaller networks. It will be interesting to learn abstractions relevant to locomotion and to be able to reuse such abstractions for more efficient learning and planning.

An advantage of using deep neural networks is that they can be readily extended to develop policies that directly accept rich perceptual input such as images, unlike traditional control methods. We wish to give Cassie such visual input and use this in support of visual navigation.

Our policies currently still take a reference motions as an input. However, once the initial tracking policy has been trained, the policy is free to develop its own movement styles according to the subsequent iterative optimizations, and will even learn to stand still if given an appropriate reward function. The reference motion does provide the policies a means to condition themselves on time or motion phase, similar to [25]. In some sense, this gives the policy more flexibility since with enough motions and offline training, it can be capable of generalizing to other unseen reference motions during testing. In another sense, the reference motion poses constraints on the final motions, i.e., it may be more difficult for a motion to make timing adjustments. Given that pure state-based feedback can also yield plausible locomotion, e.g., [43], it will be interesting to seek a balance between these two representations.



  • Apgar et al. [2018] Taylor Apgar, Patrick Clary, Kevin Green, Alan Fern, and Jonathan Hurst. Fast online trajectory optimization for the bipedal robot cassie. In Robotics: Science and Systems, 2018.
  • Berseth et al. [2018] Glen Berseth, Cheng Xie, Paul Cernek, and Michiel van de Panne. Progressive reinforcement learning with distillation for multi-skilled motion control. In Proc. International Conference on Learning Representations, 2018.
  • Chen et al. [2018] Si-An Chen, Voot Tangkaratt, Hsuan-Tien Lin, and Masashi Sugiyama. Active Deep Q-learning with Demonstration. arXiv e-prints, art. arXiv:1812.02632, December 2018.
  • Chen et al. [2018] Xi Chen, Ali Ghadirzadeh, John Folkesson, and Patric Jensfelt. Deep reinforcement learning to acquire navigation skills for wheel-legged robots in complex environments. CoRR, abs/1804.10500, 2018. URL
  • Clavera et al. [2018] Ignasi Clavera, Anusha Nagabandi, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt: Meta-learning for model-based control. CoRR, abs/1803.11347, 2018. URL
  • Da et al. [2017] X. Da, R. Hartley, and J. W. Grizzle. Supervised learning for stabilizing underactuated bipedal robot locomotion, with outdoor experiments on the wave field. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 3476–3483, May 2017. doi: 10.1109/ICRA.2017.7989397.
  • Gong et al. [2018] Yukai Gong, Ross Hartley, Xingye Da, Ayonga Hereid, Omar Harib, Jiunn-Kai Huang, and Jessy W. Grizzle. Feedback control of a cassie bipedal robot: Walking, standing, and riding a segway. CoRR, abs/1809.07279, 2018. URL
  • Heess et al. [2017] Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin A. Riedmiller, and David Silver. Emergence of locomotion behaviours in rich environments. CoRR, abs/1707.02286, 2017. URL
  • Henderson et al. [2018] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In AAAI, 2018.
  • Hereid et al. [2018a] A. Hereid, C. M. Hubicki, E. A. Cousineau, and A. D. Ames. Dynamic humanoid locomotion: A scalable formulation for hzd gait optimization. IEEE Transactions on Robotics, 34(2):370–387, April 2018a. ISSN 1552-3098. doi: 10.1109/TRO.2017.2783371.
  • Hereid et al. [2018b] Ayonga Hereid, Omar Harib, Ross Hartley, Yukai Gong, and Jessy W. Grizzle. Rapid bipedal gait design using C-FROST with illustration on a cassie-series robot. CoRR, abs/1807.06614, 2018b. URL
  • Hirai et al. [1998] K. Hirai, M. Hirose, Y. Haikawa, and T. Takenaka. The development of honda humanoid robot. In Proceedings. 1998 IEEE International Conference on Robotics and Automation (Cat. No.98CH36146), volume 2, pages 1321–1326 vol.2, May 1998. doi: 10.1109/ROBOT.1998.677288.
  • Ho and Ermon [2016] Jonathan Ho and Stefano Ermon. Generative Adversarial Imitation Learning. arXiv e-prints, art. arXiv:1606.03476, June 2016.
  • Hwangbo et al. [2019] Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots. Science Robotics, 4(26), 2019. doi: 10.1126/scirobotics.aau5872. URL
  • Kajita et al. [2001] S. Kajita, F. Kanehiro, K. Kaneko, K. Yokoi, and H. Hirukawa. The 3d linear inverted pendulum mode: a simple modeling for a biped walking pattern generation. In Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180), volume 1, pages 239–246 vol.1, Oct 2001. doi: 10.1109/IROS.2001.973365.
  • Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL
  • Laskey et al. [2017] Michael Laskey, Jonathan Lee, Roy Fox, Anca D. Dragan, and Kenneth Y. Goldberg. Dart: Noise injection for robust imitation learning. In CoRL, 2017.
  • Levine and Koltun [2013] Sergey Levine and Vladlen Koltun. Guided policy search. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1–9, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL
  • Li et al. [2018] Tianyu Li, Akshara Rai, Hartmut Geyer, and Christopher G. Atkeson. Using deep reinforcement learning to learn high-level policies on the ATRIAS biped. CoRR, abs/1809.10811, 2018. URL
  • Liu et al. [2013] Chenggang Liu, Christopher G. Atkeson, and Jianbo Su. Biped walking control using a trajectory library. Robotica, 31(2):311–322, 2013. doi: 10.1017/S0263574712000203.
  • Merel et al. [2018] Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. CoRR, abs/1811.11711, 2018. URL
  • Nakanishi et al. [2004] Jun Nakanishi, Jun Morimoto, Gen Endo, Gordon Cheng, Stefan Schaal, and Mitsuo Kawato. Learning from demonstration and adaptation of biped locomotion. Robotics and Autonomous Systems, 47(2):79 – 91, 2004. ISSN 0921-8890. doi: URL Robot Learning from Demonstration.
  • Parisotto et al. [2015] Emilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. CoRR, abs/1511.06342, 2015. URL
  • Peng et al. [2017] Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics (Proc. SIGGRAPH 2017), 36(4), 2017.
  • Peng et al. [2018] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (Proc. SIGGRAPH 2018), 37(4), 2018.
  • Pomerleau [1988] Dean Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In NIPS, 1988.
  • Posa et al. [2016] Michael Posa, Scott Kuindersma, and Russ Tedrake. Optimization and stabilization of trajectories for constrained dynamical systems. In Proceedings of the International Conference on Robotics and Automation, 2016.
  • Ross and Bagnell [2010] Stephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 661–668, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL
  • Ross et al. [2011] Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011.
  • Rusu et al. [2015] Andrei A. Rusu, Sergio Gomez Colmenarejo, Çaglar Gülçehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. CoRR, abs/1511.06295, 2015. URL
  • Schaal et al. [2003] S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert. Control, planning, learning, and imitation with dynamic movement primitives. In IROS 2003, pages 1–21. Max-Planck-Gesellschaft, October 2003.
  • Schuitema et al. [2010] E. Schuitema, M. Wisse, T. Ramakers, and P. Jonker. The design of leo: A 2d bipedal walking robot for online autonomous reinforcement learning. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3238–3243, Oct 2010. doi: 10.1109/IROS.2010.5650765.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL
  • Sutton et al. [1999] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 1999.
  • Tan et al. [2018] Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. CoRR, abs/1804.10332, 2018.
  • Tang and Hauser [2017] G. Tang and K. Hauser. A data-driven indirect method for nonlinear optimal control. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4854–4861, Sept 2017. doi: 10.1109/IROS.2017.8206362.
  • Tang et al. [2018] Gao Tang, Weidong Sun, and Kris Hauser. Learning trajectories for real-time optimal control of quadrotors. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018.
  • Tedrake et al. [2004] Russ Tedrake, Teresa Weirui Zhang, and H. Sebastian Seung. Stochastic policy gradient reinforcement learning on a simple 3d biped. 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), 3:2849–2854 vol.3, 2004.
  • Tedrake et al. [2015] Russ Tedrake, Scott Kuindersma, Robin Deits, and Kanako Miura. A closed-form solution for real-time zmp gait generation and feedback stabilization. In IEEE-RAS International Conference on Humanoid Robots, Seoul, Korea, 2015.
  • Todorov et al. [2012] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, Oct 2012. doi: 10.1109/IROS.2012.6386109.
  • Xie et al. [2018] Zhaoming Xie, Glen Berseth, Patrick Clary, Jonathan Hurst, and Michiel van de Panne. Feedback control for cassie with deep reinforcement learning. In Proc. IEEE/RSJ Intl Conf on Intelligent Robots and Systems (IROS 2018), 2018.
  • Xiong and Ames [2018] Xiaobin Xiong and Aaron D. Ames. Coupling reduced order models via feedback control for 3d underactuated bipedal robotic walking. IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids), 2018. URL
  • Yu et al. [2018] Wenhao Yu, Greg Turk, and C. Karen Liu. Learning symmetric and low-energy locomotion. ACM Transactions on Graphics (Proc. SIGGRAPH 2018 - to appear), 37(4), 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description