Transfer from Simulation to Real World through
Learning Deep Inverse Dynamics Model
Abstract
Developing control policies in simulation is often more practical and safer than directly running experiments in the real world. This applies to policies obtained from planning and optimization, and even more so to policies obtained from reinforcement learning, which is often very data demanding. However, a policy that succeeds in simulation often doesnât work when deployed on a real robot. Nevertheless, often the overall gist of what the policy does in simulation remains valid in the real world. In this paper we investigate such settings, where the sequence of states traversed in simulation remains reasonable for the real world, even if the details of the controls are not, as could be the case when the key differences lie in detailed friction, contact, mass and geometry properties. During execution, at each time step our approach computes what the simulationbased control policy would do, but then, rather than executing these controls on the real robot, our approach computes what the simulation expects the resulting next state(s) will be, and then relies on a learned deep inverse dynamics model to decide which realworld action is most suitable to achieve those next states. Deep models are only as good as their training data, and we also propose an approach for data collection to (incrementally) learn the deep inverse dynamics model. Our experiments shows our approach compares favorably with various baselines that have been developed for dealing with simulation to real world model discrepancy, including output error control and Gaussian dynamics adaptation.
I Introduction
Many methods exist for generating control policies in simulated environments, including methods based on motion planning, optimization, control, and learning. However, an important practical challenge is that often there are discrepancies between simulation and the real world, which results in policies that work well in simulation yet perform poorly in the real world.
Significant bodies of work exist that strive to address this challenge. One important line of work studies how to improve simulators to better match reality, which involves improving simulation of contact, nonrigidity, friction, as well as improving identification of physical quantities needed for accurate simulation such as mass, geometry, friction coefficients, elasticity. However, despite significant progress, discrepancies continue to exist, and more accurate simulation can have the downside of being slower.
Another important line of work studies robustness of control policies, which could be measured through, for example, gain and phase margins, and robust control methods exist that can optimize for these. Optimizing for robustness means finding control policies that apply across a wide range of possible real worlds, but unfortunately tends to come at the expense of performance in the one specific real world the system is faced with.
Adaptive methods, which is the topic of this paper, do not use the same policy for the entire family of possible environments, but rather try to learn about the specific real world the system is faced with. In principle, such methods can exploit the physics of the real world and behave in the optimal way.
Concretely, our work considers the following problem setting: We assume to be given a simulator and a method for generating policies that perform well in simulation. The goal is to leverage this to perform well in new realworld situations. To achieve this, a training period exists during which an adaptation mechanism can be trained to learn to adapt from simulation to real world by collecting experience on the real system, but without having access to the new realworld situations that the system will be evaluated on later.
We leverage the following intuition: Often policies found from simulation capture the highlevel gist well (e.g., overall trajectory), but fail to accurately capture some of the lowerlevel details, such as friction, stiction, backlash, hysteresis, precise measurements, precise deformation, etc. Indeed, this is the type of situation that motivates the work in this paper and in which we will be evaluating our approach (as well as baselines).
Note that while we assume that a method exists for generating policies in simulation, our approach is agnostic to the details of this method, which could be based on any techniques from motion planning, optimization, control, learning, and others, which return a policy, which could be a modelpredictive policy which uses the simulator in its inner loop.
Our approach proceeds as follows: During execution on a test trajectory, at each time step it computes what the simulationbased control policy would do, but then, rather than executing these controls on the real robot, our approach computes what the simulation expects the next state(s) will be, and then relies on a learned deep inverse dynamics model to decide which realworld action is most suitable to achieve those next states. As our experiments show, when these inverse dynamics models are trained on sufficient data, this results in compelling transfer from simulation to real world, in particular with challenging dynamics involving contact and collision. To collect the training data, there is a training phase which proceeds the same way, but only has access to a poor inverse dynamics model, and then uses the collected data to improve the model. Our experiments show that having the training data collection be similar to the test time conditions improves results significantly compared to data collection based on just applying random controls. To maximize data collection efficiency, target trajectories for training are initially short (or cut short once significantly deviating from the target).
Our experiments validate the applicability of our approach through two families of experiments: (i) Sim1 to Sim2 Transfer: To better understand the transfer capabilities, we first study transfer from one simulation (Sim1) to another simulation (Sim2). We consider several standard tasks: Reacher, Hopper, Cheetah, Humanoid from MuJoCo / OpenAI Gym [38] [3]. For each experiment Sim2 has the same type of robot as Sim1, but the physical properties are different (change in mass, link lengths, friction coefficients, torque scale and limits). (ii) Sim to Real Transfer with Fetch: In this family of experiments we study transfer of policies that work well for a simulated Fetch robot onto a real Fetch robot. To calibrate performance, we consider as a baseline a PD controller tuned for our Fetch robot.
Ii Related work
Simulation has been an invaluable tool in advancing the development of robotics and many simulation techniques have been developed over the years. Reduced coordinate rigid multibody dynamics are especially suited for simulating articulated robots [9]. Unfortunately, many significant physical effects may not be possible to model with such simulation approaches. Flexible or inflatable bodies [35] [13], area contact [12], interaction with fluids [34] [31] are just a few of such examples. More accurate simulators, such as those based on Finite Element Method [14] can be used to more closely match such real world effects, but they can be extremely computationally intensive (requiring days to compute seconds of simulation) and furthermore can be numerically illconditioned, which makes them difficult to use within numerical trajectory or policy optimization methods. Our method allows the use of simple, highperformance, and numerically smooth rigid body simulators (we use MuJoCo [38]) for policy or trajectory optimization, while still being able to adapt to complex effects present in the real world.
Even if a simulator were capable of modeling all the physical effects of interest, it would still require detailed and accurate model parameters (such as mass distributions, material properties, etc.). A significant body of research has focused on identifying these parameters from observations of robots’ behavior in the real world, but tend to require separate specialized identification approaches and models for different robot platforms, such as legged robots [23], helicopters [26], or fixedwing UAVs [18]. Furthermore, individual physical effects also require specialized expertdesigned models and parameter identification methods, such as motor backlash [17], hydraulic actuation [6], series elastic actuation [30], or pneumatic actuation [36]. Our learned deep inverse dynamics models are based on past histories of observed states and in principle have the ability to model the above effects and platforms in one simple unified method without requiring any domainspecific manual model design and identification.
To remove the need for explicit dynamics, learning of dynamics models has received much attention in recent years. A number of approaches learn forward dynamics models  functions mapping current state and action to a next state [24] [32]. Such functions can then be used to solve for actions that lead to desired next state. Alternatively, inverse dynamics models learn a mapping from current and next state to an action that achieves the transition between the two [29], [4], [25]. Such models are appealing because their output can be directly used for control, and is the model type we use in this work. The data for model learning is typically gathered in a batch fashion, either from random trajectories, or from representative demonstrations. This can be problematic if the robot state trajectories resulting from policy execution do not match the model training trajectories. An alternative is to learn dynamics models in an online fashion, constantly adapting the model based on an incoming stream of observed states and actions [11] [28] [43] [22]. These approaches however are slow to adapt to rapidlychanging dynamics modes (such as those arising when making or breaking contact) and may be problematic when applied on robots performing rapid motions. Another alternative is to iteratively intertwine data collection and dynamics model learning [7] [10]. Such approaches concentrate training data in the regions of the state space that are relevant for task completion and inspire the data collection procedure in our work.
A number of options are available for representation of learned dynamics functions, from linear functions [28] [43], to Gaussian processes [2] [19] [7], to deep neural networks [32] [11]. Linear functions are very efficient to evaluate and solve controls for, but have limited expressive power. Gaussian Processes are able to provide model uncertainty estimates, but are problematic to scale to large dataset sizes. Deep neural networks are an expressive class of functions independent of dataset size and are what we use in this work.
Our approach to transfer between simulator and the real world is based on adapting actions. There is a rich body of work focusing on adapting policies, rather than actions in the context of reinforcement learning [37] [1] [5]. Another alternative is to consider robust control methods in simulation that produce policies that are robust to mismatch between simulator and the real world [44] [27]. In addition to actions, adaptation of states and observations between simulation and the real world is another challenging problem [41] [16] [40] [8]. In the current work, we choose to focus solely on adaptation of actions and leave other types of adaptation for future work.
Iii Method
Iiia Setting
We study transfer from a source environment to a target environment. Typically the source environment would be a simulator, and the target environment would be a physical robot. However, in order to validate our method we start by having simulator both in the source and in the target domain. This setup has merit in developing an experimental understanding of our approach, as we can control the degree of variation between source and target environments. Our final experiments are in transfer from a simulator to the physical robot.
For each environment we denote the state space by , the action space by and the observation space by . Points , , are states, actions, and observations. The state is not assumed observed. Overloading notation slightly, the agent makes noisy and incomplete observations of the underlying system, , which typically don’t expose some latent factors (e.g., fluctuating temperature or motor backlash). The special situation where the state is observed is readily captured by having the observation function . The system forward dynamics are given by a function from stateaction pair to a new state: .
We use subscripts to explicitly distinguish between the source environment and the target environment. For example, denotes the action space in the source environment, and denotes the action space in the target environment.
A trajectory is a sequence of observations and actions: . We write to refer to the subsequence . We write to refer to the most recent observations and actions in a trajectory, and to refer to the most recent observation.
A policy is a mapping from observations to actions, that depends on the last observations, prescribing . Our goal is to find a policy that performs well in the target environment.
Rather than learning a policy for the target environment from scratch, we assume that we have access to a competent policy in the source environment. Such policy could be obtained through any of a variety of methods, including motion planning, modelpredictive or optimizationbased control, reinforcement learning, etc. Our approach is agnostic to how the policy was obtained.
IiiB Transfer to the target environment
Rather than directly executing in the target environment, we seek to transfer the highlevel properties of to be reused in the target environment, but not its lowerlevel specifics. Our approach is illustrated in Figure 1. During execution, we repeat the following at every time instant: consider the recent history of observations , compute the action which our source policy prescribes for the source environment. Simulate what observation would be attained at the next time step in the source environment, and then compute . is a learned inverse dynamics model for the target environment, which takes in the recent history of actions and observations, as well as the desired next observations, and produces the action in the target domain that leads as close as possible to the desired observation .
Putting this all together, we have:
To be able to execute this approach, we assume that the simulator provides a forward dynamics model that allows us to compute a reasonable estimate of the next state and observation .
If the learned inverse dynamics model is sufficiently accurate, then the next observation after taking action will be similar to .
For this approach to be meaningful, it is assumed that source and target environments have the same actuated degrees of freedom. However, the actions taken by policies and may be very different from each other. For example, the actuators may be calibrated differently, or realistic actuators may have complex dynamics like fluctuating temperature or gear backlash, which are not modeled in simulation. The dimensionality of the action space may even be different, for example when the target domain actions may be over biarticular pairs of antagonistic cables or muscle tendons, as in [21]. We have such flexibility in our method because the actions generated by the policy are never directly used in the target space, but only through mediation of the simulator and the anticipated next observation.
IiiC Training of the inverse dynamics model
We propose to collect trajectories in the physical environment, and to train a neural network that represents the inverse dynamics model, i.e., that can (approximately) predict the action that will lead to the next observation. For a snippet of a trajectory: and next observation , we train a neural network to predict the preceding action :
We incorporate history in our model and pick the history window parameter to be large enough that can (implicitly) infer any important latent factors or temporal dependencies present in the dynamics.
IiiD Data collection / Exploration
At each point during training we have a preliminary inverse dynamics model , which we can use to implement a preliminary policy . In order to collect training data for our model, we execute this preliminary policy . We add noise to the prescribed actions for exploration, i.e., in order to ensure that we have sufficiently diverse training data. Adding too much noise will result in data collected too far from the target trajectories, adding too little noise will result in insufficient exploration and the inverse dynamics model will improve very slowly. In our experiments we describe our noise settings. We found it helpful to not add noise at every time step. Adding noise too frequently steers the data collection too far away from the relevant parts of the space for the task at hand. In simulation we can collect training samples very efficiently by setting the simulator to the states that occur along a trajectory; in a physical system, the efficiency of collecting training data depends on the amount of noise that can be injected into the controls before the robot moves far enough from the target trajectories that its behavior is no longer useful for training. We also found it more efficient to reset once the target execution starts deviating very far from what would have happened in the source environment.
IiiE Inverse dynamics neural network architecture
All of our inverse dynamics models take as input a sequence of previous observations, previous actions, and a target observation. Observations and actions are concatenated into one large input vector for the neural net. As is common in current neural net learning practice, the neural network inputs are normalized to have mean and variance [20]. We then apply a sequence of two fullyconnected hidden layers with ReLU activations and units each, followed by a fullyconnected output layer, which gives the action .
Iv Experiments
The purpose of our method is to adapt a policy from a source environment to a target environment, with the key application being adaptation from simulation to real world. First, we measure adaptation capability between two simulators IVA as this allows us to quantify most directly the differences between source environment and target environment. Then, we present results for adaptation from a simulation to a physical environment.
Iva Simulated Environments – Sim1 to Sim2 transfer
We test our method on several simulated models in the robotics simulator MuJoCo [38] using OpenAI Gym environment [3]. Therefore, both source and target environments are in simulation. We perform experiments on the following standard OpenAI Gym environments (Figure 2). In each case, observation space consists of positions and velocities of all degrees of freedom.

Reacher. Twolink arm aiming toward a target location, with a 11dimensional observation space and 2 actuators. Arm end effector and target are included in the observation.^{1}^{1}1We modify the Gym environment by increasing the mass of the arm to be kilograms, roughly in line with the physical Fetch robot. This has a minimal effect on the original task, but it becomes relevant when we try to adapt to a modified version of the task with different gravity.

Hopper. Twodimensional model of a robot with a single “foot” that moves by hopping, with a 12dimensional observation space and 3 actuators.

HalfCheetah. Twodimensional model of a bipedal robot with a 17dimensional observation space and 6 actuators.

Humanoid. Threedimensional model of a humanoid robot with a 376dimensional observation space and 17 actuators.
In each environment, we train our models to imitate an “expert policy”. The expert policies are obtained from Trust Region Policy Optimization [33] (source code by Ho et. al [15]). We measure the performance of policies using a reward given by OpenAI gym [3].^{2}^{2}2These reward functions feature penalties for applying large torques; we remove these penalties, because they make it more difficult to interpret results which require gravity compensation or for which there is motor noise. We normalize the performance measurement so that the performance of the expert policy is .
Note that our algorithm never observes the performance of the adapted policy. This is important for our intended application; evaluating the performance of adapted policies operating in the real world is typically more expensive than executing those policies, as it might, for example, require instrumentation of the physical world with ground truth sensors. We only use the performance measures to determine whether our method has successfully adapted the critical features of the expert policy.
To produce training data, we interleave learning with execution in the target domain, executing the previous estimate of the inverse dynamics model to generate trajectories to be used for further training. We interrupt trajectories at a random point in order to take a random action, and train the model to predict the random action from the resulting state (as well as the history of recent states and actions). We report all of our training times in terms of the number of training samples that we collect. In the case when inverse dynamics model includes history (as described in IIIC), we use a window size for all the experiments.
We compare our approach to several popular methods that have been developed to deal with simulation to real world model discrepancy. The baselines we use are:

Expert Policy. We perform no adaptation and directly use the actions of the policy obtained from source domain in the new target domain. .

Output Error Control. We perform Model Predictive Control in the target domain using an adapted version of a dynamics model transferred from the source domain. At each timestep, we use the current observation and previous action to update the dynamics model, and use the updated dynamics model to compute a policy using iterative LQR [39]. Output Error Control dynamics adaptation scheme adjusts the source dynamics model
by an error term
representing a decayed version of the error in in the target domain.

Gaussian Dynamics Adaptation. As the previous baseline, we perform Model Predictive Control using iterative LQR on an adjusted dynamics model. The adjustment scheme in this case uses the source dynamics model to form a local Gaussian prior We update this prior according to the empirical mean and covariance of the data observed in target domain, and condition it to form
This is the approach proposed and described in more detail in [11].
To test the capability of our method compared to the baseline methods, we consider two following challenging differences between domains:

Variation in Gravity. Target environment has a difference in gravity from the source environment. Gravity differs in magnitude by for locomotion tasks. The Reacher task occurs in a plane; the expert policy is trained in a horizontal plane and essentially unaffected by gravity, and we test on planes that are rotated from to . On the Reacher task, our method is able to adapt successfully to this significant dynamics change.

Motor Noise. Before an action is sent to the robot, it is perturbed by adding a noise term to obtain . We experiment with two variants, where this noise is independent on each time step, as well as where this noise varies slowly and is correlated over time. Such noise is more representative of fluctuating environmental conditions, or latent physical effects like temperature changes.
In many cases, only small corrections to the source domain actions are necessary to adapt to target domain. In such a setting, it may be beneficial for to output a correction term rather than an action directly:
This has the downside of directly requiring actions from the source domain, but tends to result in better performance when the domains are similar. We use such a correction formulation for motor noise standard deviations below 0.3 and for all locomotion experiments with varying gravity. In such cases, we also found it most helpful to pretrain the model on trajectories produced by the expert policy.
As expected, simply applying actions from an expert policy from a source domain results in poor performance on the target domain. Baselines that perform planning using a locally Gaussian forward dynamics model that is adapted online performed well with no additional training on the target domain in environments with simple dynamics (e.g., no contacts) such as Reacher and relatively slowly changing variation between the source and target domain. However, we found these methods to be ineffective in contactrich environments such as Hopper, Cheetah, and Humanoid, even in the source domain. Contacts induce discontinuities that cause methods using locally linear dynamics approximations to perform poorly. Unstable tasks like Hopper and Humanoid are particularly poorly suited for these methods because small errors propogate over long trajectories, leading to episode termination.
Our method is also able to correct for slowlyvarying noise and small changes to system dynamics. Moreover, it is able to adapt even in the presence of contact discontinuities that are extermeley challenging for approaches based Model Predictive Control. Such approaches require solving an optimization problem (iterative LQG) that can exploit the learned forward dynamics model and take it outside the regime it was trained on. By learning an inverse dynamics model, we simply take the output of such models and avoid performing potentially unstable numerical optimization.
Noise std.  none  0.2  1.0  
Noise correlation  none  0.0  0.9  1.0  0.0  0.9  1.0 
Number of training samples in thousands (smaller is better)  
Hopper  
Adaptation without history  31  58  48  77  150  157  137 
Adaptation with history  24  31  29  28  70  121  47 
Learning from scratch  about 1000  
Humanoid  
Adaptation without history  13  15  20  16  N/A  N/A  155 
Adaptation with history  16  17  19  16  N/A  N/A  54 
Learning from scratch  about 70000 
IvB Physical interaction – Sim to Real transfer
We test our method on transferring trajectories from a simulated source domain to the target domain, which is physical Fetch robot [42]. We control the robot using position control and stock firmware based on ROS in 10Hz frequency.
The tasks consider control of the arm, and our metric measures normalized distance between observations achieved in the simulator by the trajectory and observations achieved on the physical robot. The task is an agile backandforth swing of an arm where middle of the arm is pulled by a bungee cord. Our action adaptation method is able to adjust to this condition by adapting and exerting the necessary about of torque. As a baseline we use PD controller with targets being states experienced in the simulator. Table 5 summarizes our results.
Swings limited with a bungee cord  

Our method  
PD controller 
V Discussion and Future Work
We have presented a general method to adapt actions of policies developed in one domain such as simulation to a different domain such as the physical world. We achieve this by learning a deep inverse dynamics model that is trained on the behavior of the physical robot. Our method is successfully able to adapt complex control policies for aggressive reaching and locomotion on scenarios involving contact, hysteresis effects in the form of timecorrelated noise, and significant differences between environments. However to bring about robots that truly generalize in the physical world, in addition to action adaptation it is necessary to also adapt states and observations between simulation and physical world. We currently assume observations generated by our simulator match closely to physical observations, which is reasonable when considering sensors such as joint positions, but is it not reasonable to expect simulated visual or depth sensors to match the high fidelity of the real world. This work only focused on action adaptation. In the future we plan to experiment with observation adaptation methods, such as [41] for instance. Additionally, our approach can be applied to a setting where we do not even observe the actions taken in the source domain. This presents exciting future opportunities to apply our method to use solely observations in the source domain (such as driving dashboard camera recording, for example) to recover and adapt actions for a corresponding driving policy.
References
 [1] Samuel Barrett, Matt E. Taylor, and Peter Stone. Transfer learning for reinforcement learning on a physical robot. In Ninth International Conference on Autonomous Agents and Multiagent Systems  Adaptive Learning Agents Workshop (ALA), May 2010.
 [2] Joschka Boedecker, Jost Tobias Springenberg, Jan WÃ¼lfing, and Martin Riedmiller. Approximate realtime optimal control based on sparse gaussian process models. In Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014.
 [3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 [4] R. Calandra, S. Ivaldi, M. Deisenroth, E. Rückert, and J. Peters. Learning inverse dynamics models with contacts. In IEEE International Conference on Robotics and Automation, pages 3186–3191, 2015.
 [5] Mark Cutler, Thomas J Walsh, and Jonathan P How. Realworld reinforcement learning via multifidelity simulators. IEEE Transactions on Robotics, 31(3):655–671, 2015.
 [6] Benoit Boulet Laeeque Daneshmend. System identification and modelling of a high performance hydraulic actuator. In Eds.), Lecture Notes in Control and Information Sciences. Springer Verlag, 1992.
 [7] Marc Peter Deisenroth and Carl Edward Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In In Proceedings of the International Conference on Machine Learning, 2011.
 [8] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, pages 647–655, 2014.
 [9] Roy Featherstone. Rigid Body Dynamics Algorithms. SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2007.
 [10] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. CoRR, abs/1603.00448, 2016.
 [11] Justin Fu, Sergey Levine, and Pieter Abbeel. Oneshot learning of manipulation skills with online dynamics adaptation and neural network priors. CoRR, abs/1509.06841, 2015.
 [12] G. Gilardi and I. Sharf. Literature survey of contact dynamics modelling. Mechanism and Machine Theory, 37(10):1213 – 1239, 2002.
 [13] Abhishek Gupta, Clemens Eppner, Sergey Levine, and Pieter Abbeel. Learning dexterous manipulation for a soft robotic hand from human demonstration. CoRR, abs/1603.06348, 2016.
 [14] Karlsson Hibbitt and Sorensen. ABAQUS/CAE User’s Manual. Hibbitt, Karlsson & Sorensen, Incorporated, 2002.
 [15] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. arXiv preprint arXiv:1606.03476, 2016.
 [16] Judy Hoffman, Erik Rodner, Jeff Donahue, Trevor Darrell, and Kate Saenko. Efficient learning of domaininvariant image representations. arXiv preprint arXiv:1301.3224, 2013.
 [17] GE Hovland, S Hanssen, E Gallestey, S Moberg, T Brogardh, S Gunnarsson, and M Isaksson. Nonlinear identification of backlash in robot transmissions. In Proceedings of the 33rd ISR (International Symposium on Robotics), 2002.
 [18] Thomas Ingebretsen. System identification of unmanned aerial vehicles. 2012.
 [19] Jonathan Ko and Dieter Fox. Gpbayesfilters: Bayesian filtering using gaussian process prediction and observation models. Auton. Robots, 27(1):75–90, 2009.
 [20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 [21] Yoonsang Lee, Moon Seok Park, Taesoo Kwon, and Jehee Lee. Locomotion control for manymuscle humanoids. ACM Trans. Graph., 33(6):218:1–218:11, November 2014.
 [22] Ian Lenz, Ross Knepper, and Ashutosh Saxena. Deepmpc: Learning deep latent features for model predictive control. In RSS, 2015.
 [23] Michael Yurievich Levashov. Modeling, System Identification, and Control for Dynamic Locomotion of the LittleDog Robot on Rough Terrain. PhD thesis, Citeseer, 2012.
 [24] L. Ljung. System Identification: Theory for the User. Prentice Hall information and system sciences series. Prentice Hall PTR, 1999.
 [25] Franziska Meier, Daniel Kappler, Nathan Ratliff, and Stefan Schaal. Towards robust online inverse dynamics learning. In Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems. IEEE, 2016.
 [26] Bernard Mettler, Mark B. Tischler, and Takeo Kanade. System identification of smallsize unmanned helicopter dynamics. In Presented at the American Helicopter Society 55th Forum, May 1999.
 [27] Igor Mordatch, Kendall Lowrey, and Emanuel Todorov. Ensemblecio: Fullbody dynamic motion planning that transfers to physical humanoids. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 5307–5314. IEEE, 2015.
 [28] Igor Mordatch, Nikhil Mishra, Clemens Eppner, and Pieter Abbeel. Combining modelbased policy search with online model learning for control of physical humanoids. In Proceedings of the IEEE International Conference on Robotics and Automation, 2016.
 [29] D. NguyenTuong and J. Peters. Using model knowledge for learning inverse dynamics. pages 2677–2682, Piscataway, NJ, USA, May 2010. MaxPlanckGesellschaft, IEEE.
 [30] Nicholas Paine, Joshua S. Mehling, James Holley, Nicolaus A. Radford, Gwendolyn Johnson, ChienLiang Fok, and Luis Sentis. Actuator control for the nasajsc valkyrie humanoid robot: A decoupled dynamics approach for torque control of series elastic robots. Journal of Field Robotics, 32(3):378–396, 2015.
 [31] Zherong Pan, Chonhyon Park, and Dinesh Manocha. Robot motion planning for pouring liquids. In Proceedings of the TwentySixth International Conference on Automated Planning and Scheduling, ICAPS 2016, London, UK, June 1217, 2016., pages 518–526, 2016.
 [32] Ali Punjani and Pieter Abbeel. Deep learning helicopter dynamics models. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 3223–3230. IEEE, 2015.
 [33] John Schulman, Sergey Levine, Philipp Moritz, Michael I Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
 [34] Jie Tan, Yuting Gu, Greg Turk, and C. Karen Liu. Articulated swimming creatures. In ACM SIGGRAPH 2011 papers, SIGGRAPH ’11, pages 58:1–58:12. ACM, 2011.
 [35] Jie Tan, Greg Turk, and C. Karen Liu. Soft body locomotion. ACM Trans. Graph., 31(4):26:1–26:11, 2012.
 [36] Yuval Tassa, Tingfan Wu, Javier Movellan, and Emanuel Todorov. Modeling and identification of pneumatic actuators. In 2013 IEEE International Conference on Mechatronics and Automation, pages 437–443. IEEE, 2013.
 [37] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
 [38] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
 [39] Emmanuel Todorov and Weiwei Li. A generalized iterative LQG method for locallyoptimal feedback control of constrained nonlinear stochastic systems. In American Control Conference, 2005. Proceedings of the 2005, pages 300–306 vol. 1. IEEE, June 2005.
 [40] Eric Tzeng, Coline Devin, Judy Hoffman, Chelsea Finn, Xingchao Peng, Sergey Levine, Kate Saenko, and Trevor Darrell. Towards adapting deep visuomotor representations from simulated to real environments. CoRR, abs/1511.07111, 2015.
 [41] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. CoRR, abs/1510.02192, 2015.
 [42] Melonee Wise, Michael Ferguson, Derek King, Eric Diehr, and David Dymesich. Fetch and freight: Standard platforms for service robot applications. Workshop on Autonomous Mobile Service Robots, 2016.
 [43] Michael C. Yip and David B. Camarillo. ModelLess Feedback Control of Continuum Manipulators in Constrained Environments. IEEE Transactions on Robotics, 30(4):880–889, August 2014. 00005.
 [44] Kemin Zhou and John Comstock Doyle. Essentials of robust control, volume 104. Prentice hall Upper Saddle River, NJ, 1998.