Unlocking the Potential of Simulators: Design with RL in Mind

Unlocking the Potential of Simulators: Design with RL in Mind


Using Reinforcement Learning (RL) in simulation to construct policies useful in real life is challenging. This is often attributed to the sequential decision making aspect: inaccuracies in simulation accumulate over multiple steps, hence the simulated trajectories diverge from what would happen in reality.

In our work we show the need to consider another important aspect: the mismatch in simulating control. We bring attention to the need for modeling control as well as dynamics, since oversimplifying assumptions about applying actions of RL policies could make the policies fail on real-world systems.

We design a simulator for solving a pivoting task (of interest in Robotics) and demonstrate that even a simple simulator designed with RL in mind outperforms high-fidelity simulators when it comes to learning a policy that is to be deployed on a real robotic system. We show that a phenomenon that is hard to model – friction – could be exploited successfully, even when RL is performed using a simulator with a simple dynamics and noise model. Hence, we demonstrate that as long as the main sources of uncertainty are identified, it could be possible to learn policies applicable to real systems even using a simple simulator.

RL-compatible simulators could open the possibilities for applying a wide range of RL algorithms in various fields. This is important, since currently data sparsity in fields like healthcare and education frequently forces researchers and engineers to only consider sample-efficient RL approaches. Successful simulator-aided RL could increase flexibility of experimenting with RL algorithms and help applying RL policies to real-world settings in fields where data is scarce. We believe that lessons learned in Robotics could help other fields design RL-compatible simulators, so we summarize our experience and conclude with suggestions.


Reinforcement Learning in Robotics, Learning from Simulation


We thank Christian Smith and Danica Kragic for guidance in the “Pivoting Task” work, which was supported by the European Union framework program H2020-645403 RobDREAM.


1 Introduction

Using simulators to learn policies applicable to real-world settings is a challenge: important aspects of agent and environment dynamics might be hard and costly to model. Even if sufficient modeling knowledge and computational resources are available, inaccuracies present even in high-fidelity simulators can cause the learned policies to be useless in practice. In the case of sequential decision making even a small mismatch between the simulated and the real world could accumulate across multiple steps of executing a policy.

Despite these challenges, there is potential if simulators are designed with RL in mind. In our work we aim to understand what aspects of simulators are particularly important for learning RL policies. We focus first on the field of Robotics and present an approach to learn robust policies in a simulator, designed to account for uncertainty in the dynamics as well as in control of the real robot. We discuss our simulation and hardware experiments for solving a pivoting task (a robotic dexterity problem of re-orienting an object after initial grasp). Solving this task is of interest to the Robotics community; our approach to solving this task is described in detail in [1]. Here we only include the aspects of the work relevant to an interdisciplinary RLDM audience and present it as an example of RL-compatible simulator design.

We then discuss the potential of building simulators with RL in mind for other fields. In a number of fields enough domain knowledge is available to construct a task-specific or setting-specific simulator for a particular problem. It would be useful to ensure that such simulators can facilitate learning RL policies that work when transferred to real environments. This would significantly broaden the range of RL algorithms that could be used to solve a given problem – eliminating the need to restrict attention to only sample-efficient approaches (as is often done, since real-world learning is costly).

2 Why Use Simulators for Learning?

Recent success of Reinforcement Learning in games (Atari, Go) and Robotics make RL potentially the next candidate to solve challenging decision making (under uncertainty) problems in a wide variety of fields. A number of attempts to apply RL algorithms to real-world problems have been successful, but the success was mostly limited to sample-efficient approaches. It is costly and time-consuming to search for optimal policies by obtaining real-world trajectories. Batch RL – a class of algorithms that learn only from previously collected samples – could be helpful. But batch RL constitutes only a small fraction of a variety of RL algorithms. Hence, to open up the potential for using a wider range of RL algorithms we can turn to simulation.

A number of fields have already developed general-purpose simulators1, and task-specific simulators are frequently developed for solving concrete research problems. But there is a need to facilitate successful transfer of the policies learned in simulation to real-world environments. To achieve this we can either ensure a tight enough match between simulated and real environments and use the learned policies directly, or develop data-efficient “second-stage” approaches to adjust the learned policies to the real world.

An additional consideration is current rapid advances in using deep neural networks for RL. Several recently successful algorithms, like Trust Region Policy Optimization (TRPO) [8] and Deep Deterministic Policy Gradient (DDPG) [6] are not designed to be particularly sample-efficient. Instead these allow to learn flexible control policies, and can use large neural networks as powerful policy and Q function approximators. In simulation, “deep RL” algorithms can solve problems similar in principle to those considered, for example, in Robotics. However, significant practical problems remain before they can be routinely applied to the real-world. Algorithms might not be data-efficient enough to learn on real hardware, so (pre-)training in simulation might be the only feasible solution. Success in simplified simulated settings does not immediately imply the policies could be transferred to a real-world task “as-is”, but investigating the conditions necessary for successful transfer could enable approaches that succeed in real world environments.

3 Challenges and Opportunities of Simulation-based Learning: Insights from Robotics

Robotics presents a great opportunity for research in simulation-to-reality transfer for RL agents. First of all, a wide variety of general and task-specific simulators have been built to simulate various robotic systems. These simulators can model certain aspects well (e.g. the dynamics of rigid bodies with known masses and shapes). However, other aspects are harder to model precisely (e.g. dynamics of non-rigid bodies, friction). Hence, so-called “high-fidelity” simulators contain a combination of models that are precise enough to describe real-world interactions and models that are either imprecise or even misleading. This creates a challenging problem from the perspective of an uninformed RL agent. However, the simplifying aspect is that getting insights into RL algorithms is easier. Experienced roboticists can have a reasonable guess as to which aspect of the environment is hardest for the agent to learn correctly from simulation. This is in contrast with other fields. For example, some leading approaches for modeling student’s knowledge in the field of education have error rates not far from random guessing when applied to non-synthetic student data. So the modeling problem is hard, but it is not at all clear which aspect is at fault and what exactly is making the real student data so challenging. Another example: modeling traffic in data networks requires significant simplifications for large-scale (internet-scale) simulators. This causes difficulties if an RL agent is learning a policy for handling individual packets. By comparison, robotic simulators provide a more clear picture of what would or would not be modeled well, and there is a way to quickly test policies learned in simulation on the real environment. Such robotics experiments are frequently more accessible than running large-scale studies involving humans, where returns from the policy could be distant in time (e.g. learning outcome of a class lasting several months) or too high-stakes for experimentation (result of a treatment for a patient). Thus one strategy is to develop and test a variety of simulation-to-reality transfer approaches on robotic tasks, then use the promising approaches for costlier and longer-term studies in other fields.

We note that care should be taken to capture (even if only approximately) all the relevant environment and control dynamics in a simulator. Roboticists have looked into problems arising from under-modeling of dynamics and hardware wear-and-tear (e.g. [2]) and experimented with various approaches of injecting noise in the deterministic simulators to increase tolerance to slight modeling inaccuracies ( [5] gives a summary and further references). Nonetheless, one aspect that is frequently overlooked is the application of control actions to real systems. Simple systems built in the lab can be designed to support fast and direct control. However, as the field moves to more sophisticated robotic systems that are built for a variety of tasks – oversimplifying assumptions of control could render simulation useless. Consider, for example, the Baxter robot – a medium-to-low-cost system built for small businesses and research labs. This robot is built to be relatively inexpensive, safe and versatile. Hence, in its usual operating mode, the control of the robot differs from that of high-cost industrial robots, and from that of simpler custom-built research systems. Speed and precision of controlling Baxter is traded off for lower cost of components, safe and smooth motions, versatility.

In our work, in addition to incorporating previously proposed solutions to deal with policy transfer challenges, we also emphasize the need to model delays and inaccuracies in control. We envision that as more multi-purpose medium-to-low-cost robots become widely used, it should become a standard component in building simulators. With that, we demonstrate that even a simple simulator designed with this in mind can be suitable for simulation-based RL.

4 Building RL-compatible Simulator for Solving a Pivoting Task

In this section we present a short summary of our research work described in [1], with an intent to give an example that is comprehensible to an interdisciplinary audience. We describe our approach to building a task-specific simulator for a pivoting task, then learning an RL policy in simulation and using it on the real robot2 (without further adjustments).

Figure 1: Pivoting task on Baxter robot.

The objective of a pivoting task is to pivot a tool to a desired angle while holding it in the gripper. This can be done by moving the arm of the robot to generate inertial forces sufficient to move the tool and, at the same time, opening or closing the gripper’s fingers to change the friction at the pivoting point (gaining more precise control of the motion). The robot can use a standard planning approach to grasp the tool at a random angle . The goal is to pivot it to a target angle .

The limitations of the previous approaches include: only open loop control with no control of the gripping force [4], movement restricted to the vertical plane [9], gripper in a fixed position, thus motion of the tool is determined only by the gravitational torque and torsional friction [11]. All these prior approaches rely strongly on having an accurate model of the tool, as well as precise measurement and modeling of the friction. Since it is difficult to obtain these, we develop an alternative.

4.1 Dynamic and Friction Models

Our simulator system is composed of a parallel gripper attached to a link that can rotate around a single axis. This system is an under-actuated two-link planar arm, in which the under-actuated joint corresponds to the pivoting point. We assume that we can control the desired acceleration on the first joint. The dynamic model of the system is given by:


where the variables are as follows: and are the angles of the first and second link respectively; and are their angular acceleration and is the angular velocity of the first link; is the length of the first link; is the tool’s moment of inertia with respect to its center of mass; is the tool’s mass; is the distance of its center of mass from the pivoting point; is the gravity acceleration; is the torsional friction at the contact point between the gripper’s fingers and the tool. The second link represents the tool and is the variable we aim to control. Figure 2 illustrates this model.

Figure 2: Model of a 2-link planar arm. First link is the gripper, second is the tool rotating around pivoting point.

Pivoting exploits friction at the contact point between the gripper and the tool to control the rotational motion. Such friction is controlled by enlarging or tightening the grasp. When the tool is not moving (), the static friction is modeled according to the Coulomb friction model: , where is the coefficient of static friction and is the normal force applied by gripper’s fingers on the tool. When the tool moves with respect to the gripper, we model the friction torque as viscous friction and Coulomb friction [7]: , in which and are the viscous and Coulomb friction coefficients and is the signum function. Since most robots are not equipped with tactile sensors to measure the normal force at the contact point, as in [10] we express it as a function of the distance between the fingers using a linear deformation model: , where is a stiffness parameter, is the distance at which fingers initiate contact with the tool.

4.2 Learning Robust Policies

We formulate the pivoting task as an MDP. The state space is comprised of states observed at each time step : , with notation as in the previous subsection. The actions are: , where is the rotational acceleration of the robot arm, and is the direction of the desired change in distance between the fingers of the gripper. The state transition dynamics is implemented by the simulator, but the RL algorithm does not have an explicit access to these. Reward is given at each time step such that higher rewards are given when the angle of the tool is closer to the target angle: ( is a normalizing constant). A bonus reward of 1 is given when the tool is close to the target and stopped.

We aim to learn from simulated environment, while being robust to the discrepancies between the simulation and execution on the robot. For this purpose, we first built a simple custom simulator using equations from Section 4.1. To facilitate learning policies robust to uncertainty, we added 10% noise to friction values estimated for the tool modeled by the simulator. We also injected up to 10% randomized delay for arm and finger actions in simulation: used noisy linear increase in velocity (as response to acceleration action) and simulated changing fingers’ distance with noisy steps.

We then trained a model-free deep RL policy search algorithm TRPO [8] on our simulated setting. TRPO has been shown to be competitive with (and sometimes outperform) other recent continuous state and action RL algorithms [3]. However, to our knowledge it has not yet been widely applied to real-world robotics tasks. While the background for the algorithm is well-motivated theoretically, the approximations made for practicality, along with challenges in achieving reasonable training results with a small-to-medium number of samples, could impair the applicability of the algorithm to learning on the robot directly. Hence we explore the approach of using TRPO for policy search in a simulated environment.

Our model of delayed and noisy control matched the behavior of real Baxter better than assuming near-instantaneous high-precision control. In contrast, learning from a high-fidelity simulator (V-REP) yielded policies that failed completely on the real robot: the tool was swayed in random directions, dropped. V-REP’s oversimplification of simulating control impaired the ability of RL to learn useful policies from what was considered to be a relatively high-fidelity simulation.

4.3 Summary of Experiments

Figure 3: Evaluation of TRPO training. Initial and target angles chosen randomly from , training iterations had 50 episodes.

We trained TRPO (using rllab [3] implementation as a starting point and parameters as reported in [8] for simulated control tasks) with a fully connected network with 2 hidden layers (with 32, 16 nodes) for policy and Q function approximators. The motion was constrained to be linear in a horizontal plane. However, since the learning was not at all informed of the physics of the task, any other plane could be chosen and implemented by the simulator if needed.

To simulate the dynamics of the gripper arm and the tool, we used the modeling approach from Section 4.1 and injected noise during training to help learn policies robust to mismatch between the simulator and the robot. While we allowed only up to 10% noise in the variable modeling Coulomb friction, we observed that the policy was still able to retain 40% success rate of reaching the target angle when tested on settings with 250% to 500% change. This level of robustness alleviates the need to estimate friction coefficients precisely.

Figure 4: Distance to target vs time for experiments on Baxter averaged over all trials (30 trials for each tool).

We deployed the learned policies on a Baxter robot and ran experiments with two objects of different materials, hence different friction properties. The friction coefficients used for modeling the tool in the simulator have been estimated only for the first tool. Hence the policy was not explicitly trained using the parameters matching those of the second tool.

We obtained a success rate with tool 1 and with tool 2. As expected, the policy performs better with the tool whose parameters were used (in a noisy manner) during training in simulation. Nonetheless, the performance using the tool not seen during training was robust as well. Fig. 4 illustrates the performance averaged over all the trials. The target is reached faster when tool 1 is used, and a bit slower when tool 2 is used. The deviation from the goal region after reaching the goal is likely due to inaccuracies in visual tracking. After reporting that the tool is in the goal region, the tracker might later report a corrected estimate, indicating further tool adjustment is needed. We observe that in such cases the policy still succeeds in further pivoting the tool to the target angle.

5 Conclusion & Implications for other Fields for Designing RL-compatible Simulators

Our experience of building an RL-compatible simulator for a particular robotics task yields a few general suggestions.

1. Focused modeling of the dynamics: Since high-fidelity modeling could be expensive, one could consider modeling only the part of the dynamics relevant to the task. Domain expertise could be used to understand what is “relevant”. E.g. high-fidelity modeling of the robot arm dynamics might be unnecessary, if the task could be solved by fixing the motion to a particular plane where even simple dynamics equations are sufficient. In fields that already have general-purpose simulators such “focusing” is not always easy. The design usually does not allow to make a general-purpose simulator more task-oriented. Even if an researcher can find a restriction to the case where dynamics should be simpler – there is frequently no option in the simulator to eliminate unnecessary computations. Hence, the suggestion is to keep in mind this design consideration when building general-purpose simulators: either give the users easy control of which simulation modules to turn on/off or run automatic internal optimization of the simulator.

2. Not oversimplifying the model of executing actions: Neglecting to observe that the process of applying an action needs modeling could cause policies learned in even a high-fidelity simulator to fail on a real system. In contrast, including even a simple model of control application in the simulator could fix the problem. Another solution could be to re-formulate an MDP/POMDP for the task to make primitive actions be something that is trivial to apply to the real system (and capture all the side-effects of applying a primitive action in the dynamics of the state transitions). This solution could work, though it might require unintuitive formulations. Regardless of the approach, the main insight is not to neglect this aspect altogether (control of real robots is not instantaneous; the action of ”give medication A” might not lead to state “took medication A” on the patient’s side; the action of “show advertisement B” might be blocked/ignored by the users).

2. Even simple noise models help if sources of variability/uncertainty are identified: Adding Gaussian noise to various parts of a deterministic dynamical system is an approach frequently used in robotics to “model” uncertainty. No doubt this could be too simplistic for some settings/tasks. However, since learning in simulation opens potential for cheaper experimentation – it is possible to try various ways of capturing uncertainty and identify those sufficient for learning successful policies. As we show in our experiments, a phenomenon that is hard to model – friction – could be still exploited successfully for control even with a simple dynamical model and a simple noise model. As long as the main sources of uncertainty are identified, it could be possible to learn policies applicable to real systems even from a simple simulator.


  1. Examples of general-purpose simulators from various fields: SimStudent in education (simstudent.org); NS network simulator used for research in data networks (en.wikipedia.org/wiki/Ns_(simulator)); more than a dozen general-purpose simulators for robotics (en.wikipedia.org/wiki/Robotics_simulator).
  2. Video summarizing hardware experiments is available at https://www.youtube.com/watch?v=LWSjYI9a9xw


  1. Rika Antonova, Silvia Cruciani, Christian Smith, and Danica Kragic. Reinforcement learning for pivoting task. arXiv preprint arXiv:1703.00472, 2017.
  2. Christopher G Atkeson et al. Using local trajectory optimizers to speed up global optimization in dynamic programming. Advances in neural information processing systems, pages 663–663, 1994.
  3. Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. arXiv preprint arXiv:1604.06778, 2016.
  4. A. Holladay, R. Paolini, and M. T. Mason. A general framework for open-loop pivoting. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 3675–3681, May 2015.
  5. Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, page 0278364913495721, 2013.
  6. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  7. H. Olsson, K. J. Åström, M. Gäfvert, C. Canudas De Wit, and P. Lischinsky. Friction models and friction compensation. Eur. J. Control, page 176, 1998.
  8. John Schulman, Sergey Levine, Philipp Moritz, Michael I Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
  9. A. Sintov and A. Shapiro. Swing-up regrasping algorithm using energy control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 4888–4893, May 2016.
  10. F. E. Viña, Y. Karayiannidis, K. Pauwels, C. Smith, and D. Kragic. In-hand manipulation using gravity and controlled slip. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 5636–5641, Sept 2015.
  11. F. E. Viña, Y. Karayiannidis, C. Smith, and D. Kragic. Adaptive control for pivoting with visual and tactile feedback. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 399–406, May 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description