# Sample-Efficient Learning of Nonprehensile Manipulation Policies via

Physics-Based Informed State Distributions

###### Abstract

This paper proposes a sample-efficient yet simple approach to learning closed-loop policies for nonprehensile manipulation. Although reinforcement learning (RL) can learn closed-loop policies without requiring access to underlying physics models, it suffers from poor sample complexity on challenging tasks. To overcome this problem, we leverage rearrangement planning to provide an informative physics-based prior on the environment’s optimal state-visitation distribution. Specifically, we present a new technique, Learning with Planned Episodic Resets (LeaPER), that resets the environment’s state to one informed by the prior during the learning phase. We experimentally show that LeaPER significantly outperforms traditional RL approaches by a factor of up to 5X on simulated rearrangement. Further, we relax dynamics from quasi-static to welded contacts to illustrate that LeaPER is robust to the use of simpler physics models. Finally, LeaPER’s closed-loop policies significantly improve task success rates relative to both open-loop controls with a planned path or simple feedback controllers that track open-loop trajectories. We demonstrate the performance and behavior of LeaPER on a physical 7-DOF manipulator in https://youtu.be/feS-zFq6J1c.

## Introduction

This paper addresses the problem of learning closed-loop policies for nonprehensile rearrangement. Consider the environment in Fig. 1a. The robot must push the target object from the front of the table to the desired goal at the back of the table. Rearrangement is frequently needed for robots to manipulate diverse objects in cluttered environments. To perform this task successfully, the robot must reach and carefully manipulate the box through a clutter of movable objects. Since this involves making and breaking contact with multiple objects over long horizons, rearrangement manipulation poses a challenging robotic task.

Previous work has considered rearrangement to be a planning problem, where a planner searches for feasible paths that connect the initial configuration of the environment to the desired goal configuration. The solution is then executed in an open-loop manner to perform the rearrangement task. These methods successfully generate expressive solutions [\citeauthoryearKing, Cognetti, and Srinivasa2016]. However, a key component of these planning algorithms is that they employ approximate but fast contact models to roll out simulations [\citeauthoryearLynch and Mason1996]. Hence, the open-loop paths that the planner generates are only realizable in the real world if the approximate simulations sufficiently mimic the real world. Due to the complexities of real contact physics, contact dynamics are often significantly relaxed for efficient planning. This causes the generated open-loop trajectories to diverge from the desired behavior and fail to reach the goal configuration (Fig. 1b).

An alternative approach is to use deep reinforcement learning (RL) to learn closed-loop policies. These policies take observations of the environment as inputs and output actions that move the target object to its desired goal. Since they observe and act at a high frequency, the policies can correct for deviations caused by uncertainty in dynamics and other modelling errors. However, RL has notoriously high sample complexity [\citeauthoryearDuan et al.2016], which has limited its use to single object interactions [\citeauthoryearAgrawal et al.2016] or with carefully designed dense reward functions [\citeauthoryearBrockman et al.2016]. Interacting with multiple objects is fundamental to rearrangement, which also makes designing reward functions impractical.

The high sample complexity of RL can be attributed to the trade-off between exploration and exploitation [\citeauthoryearKearns and Singh2002], i.e. should an agent exploit its current experience to choose the seemingly best action or should it explore the environment by taking different actions that may lead to better future payouts? If the agent does not sufficiently explore, it may not converge to the optimal behavior. However, if the agent explores too much, it will require more episodes before converging on the optimal behavior. Considerable research has focused on exploration techniques for RL agents. One powerful technique is to force the policy to visit certain relevant states. CPI [\citeauthoryearKakade and Langford2002] sets the initial state distribution to a uniform distribution over the entire state space, which asymptotically ensures visiting every possible state. Although this approach provides strong guarantees of convergence, it is not feasible for the large state-space of rearrangement manipulation. A more informed distribution can be obtained from human demonstrations; \citeauthornair2017overcoming \citeyearnair2017overcoming, and \citeauthorhosu2016playing \citeyearhosu2016playing demonstrated significant improvements for manipulation tasks and Atari games. But how can we obtain these priors without expert demonstrations?

We contribute three key insights to more efficiently solving the rearrangement problem. Our first key insight is that the open-loop trajectories generated via rearrangement planning provide a powerful source of relevant states (Fig. 2). Although the quasi-static physics models used while planning are approximate, we show that the states visited by a solution are informative and significantly reduce our algorithm’s sample complexity. Experimentally, Learning with Planned Episodic Resets (LeaPER) accelerates learning by up to 5X on simulated rearrangement tasks.

Our second insight is that planning with even simpler physics models can generate solutions that are sufficiently informative to accelerate RL for rearrangement. To illustrate this, we further relax the dynamics from quasi-static contacts to welded contacts. Although this simplification increases sample complexity, it is significantly superior in performance to not using planned resets. More importantly, using simplified contact models makes the planning problem much easier to solve.

Our third insight is that LeaPER’s closed-loop policies are robust to stochastic dynamics since they can correct for deviations from the nominal path. Using LeaPER-learned closed-loop policies, we observe significant improvements in task success ratios compared to open-loop controls with a planned path or simple feedback controllers that track open-loop trajectories. We apply these robust policies, to demonstrate LeaPER’s ability to solve the rearrangement manipulation task with a 7-DOF manipulator (Fig. 2).

## Background

Before describing our work, we briefly introduce prerequisite background information and a formalism for rearrangement planning and reinforcement learning. We refer the reader to \citeauthorlatombe2012robot \citeyearlatombe2012robot and \citeauthorkaelbling1996reinforcement \citeyearkaelbling1996reinforcement for a more comprehensive introduction to these topics.

In rearrangement manipulation, a robot is tasked with moving movable objects to desired configurations in the workspace. However, there are obstacles that the robot is forbidden to penetrate.

### Solving the Rearrangement Planning Problem

Rearrangement manipulation can be considered a planning problem in a high-dimensional space that captures all possible configurations of the robot and other movable objects in the environment. The state space in which we plan is therefore a Cartesian product space of the robot’s configuration space and the configuration spaces corresponding to each movable object, i.e., . Each state is defined as , where . The free state space is the set of all states where the robot and objects do not penetrate into each other or into obstacles. Contact between the physical entities is allowed, since it is necessary for rearrangement manipulation.

The rearrangement manipulation planning problem is to find a feasible trajectory starting from and ending in the goal region at some time . Since the evolution of the state is governed by the non-linear physics of contact and interaction between the robot and the objects, the states along the trajectory must satisfy the non-holonomic constraint . Here, is the instantaneous control input, and encodes the physics of the environment. A path is feasible if there exists a control, , at every time that satisfies the physics constraint .

To find a feasible trajectory, we use a variant of Rapidly-Exploring Random Trees (RRT) [\citeauthoryearLaValle1998] shown to be effective for planning in high-dimensional spaces with non-holonomic constraints. Specifically, we use the Physics-Constrained RRT (PC-RRT) planner [\citeauthoryearKing et al.2015], which embeds a physics model in the traditional RRT algorithm. We constrain plans and corresponding actions to a manifold that is parallel to the table’s surface. Moreover, to extend the search tree during planning, we use the quasi-static physics model [\citeauthoryearLynch and Mason1996] to propagate the control actions for a given control duration.^{1}^{1}1We refer the reader to the Appendix for more details. This model assumes that the pushing motions are slow enough to render inertial forces negligible. Thus, we assume that objects move only when the robot contacts and pushes them. They immediately come to rest when the robot no longer applies a force on them. The constraint and the model therefore allow us to plan trajectories in a lower-dimensional space where each state is a set of SE(2) poses for the robot’s end-effector and each movable object [\citeauthoryearLaValle2006].

### Reinforcement Learning for Rearrangement

Rearrangement manipulation can also be framed as a continuous space Markov Decision Process (MDP) represented by the tuple , where is a set of continuous states; is a set of continuous actions; is the transition probability function; is the reward function; is the discount factor; is the initial state distribution; and is the desired goal distribution. The MDP state is related to the planning configuration state as . In the remainder of this paper, we use state to refer to the MDP state and C-state for the planning configuration state .

An episode for the agent begins by sampling from the initial state distribution . At every timestep , the agent takes an action according to a policy to reach states in the goal distribution . To achieve the goal, the agent receives a reward . The action results in a state transition to , which is sampled according to probabilities . The agent’s goal is to maximize the expected return , where the return is the discounted sum of the future rewards . The -function is defined as .

A key design decision is how to structure the reward function so that the robot achieves the desired goal configuration. Since rearrangement requires manipulating multiple objects, hand-designing a dense reward is challenging. Furthermore, recent work [\citeauthoryearAndrychowicz et al.2017] shows that densifying rewards often leads to unintended behavior. Hence, we use a sparse reward:

### Deep Deterministic Policy Gradients (DDPG)

DDPG [\citeauthoryearLillicrap et al.2015] is an actor-critic RL algorithm that learns a deterministic continuous action policy. The algorithm maintains two neural networks: the policy (or actor) with neural network parameters and a -function approximator (or critic) with neural network parameters .

During training, episodes are generated using a noisy version of the policy (called the behavior policy), e.g., . A replay buffer stores the transition tuples encountered during training [\citeauthoryearMnih et al.2015]. Training examples sampled from the replay buffer are used to optimize the critic. By minimizing the Bellman error loss , where , the critic is optimized to approximate the -function. The actor is optimized by minimizing . The gradient of with respect to the actor parameters can be computed by backpropagating through the combined critic and actor networks.

### Hindsight Experience Replay (HER)

HER [\citeauthoryearAndrychowicz et al.2017] is a simple way to manipulate the replay buffer used in off-policy RL algorithms to learn policies more efficiently using sparse rewards. Instead of learning a policy that takes only as input, the policy also takes the desired goal as input, i.e., . Without HER, after experiencing some episode , every transition is stored with the episode’s goal in the replay buffer. With HER, the replay buffer stores the experienced transitions but with different goals, i.e., states reached later in the episode. Since the goal being pursued does not influence the environment dynamics, we can replay each trajectory using arbitrary goals assuming we use an off-policy RL algorithm for optimization.

## Episodic Resets from Planned States

### A motivating example

Our method is centered around altering the initial state distribution during learning. To analyze its effect, we use the CartPole environment (Fig. 3a) as a controlled example. Given an initial state of the environment , the agent must reach the goal . The RL formulation is similar to rearrangement planning, where rewards are sparse and the agent has a fixed horizon. We examine three sources of initial states: (a) the test-time initial state distribution , (b) the uniform state distribution [\citeauthoryearKakade and Langford2002] and (c) the optimal state visitation distribution . We compute the optimal policy using iLQR [\citeauthoryearLi and Todorov2004] to generate .

We define as the ratio of sampling from and , with the probability of sampling from fixed at . Fig. 3 illustrates the effect of changing the mixing distribution on the KL divergence between the resulting policy’s state distribution and the optimal state distribution. Varying (Fig. 3b) shows how increasing samples from the uniform state distribution is better than sampling solely from . Varying (Fig. 3c) demonstrates how the learning rate can be further improved by sampling reset states from the optimal state visitation distribution . Finally, variations in (Fig. 3d) emphasize that even a small fraction of optimal states with uniformly sampled states can give a substantial speedup for learning. We note that although sampling from is useful, sampling states uniformly may not capture the relevant state space for complex tasks. Sampling from the optimal state visitation distribution alleviates this issue and ensures that the sampled states are valid and collision-free (e.g. for rearrangement). With these insights, we now present LeaPER.

### Learning from Planned Episodic Resets (LeaPER)

Our algorithm starts with the initial configuration of the environment and the desired goal space . In practice, is defined by the -ball around a single goal C-state . The first part of our method needs a planner to solve the rearrangement task, which gives us a trajectory of states . Here, and . Note that although this trajectory is feasible with quasi-static physics, it may not be feasible with real physics.

Given the planned trajectory , we can now start the learning process. For each episode in the environment, the initial state is uniformly sampled from with probability . With probability , is sampled from the original start distribution. Given this initial state, an episode is collected by rolling out the policy in the physics environment. The RL algorithm then optimizes the weights of and updates the policy to . After episodes, the closed-loop policy rearranges objects from to .

Since we have a sparse reward and a long horizon task, we use Hindsight Experience Replay (HER) with Deep Deterministic Policy Gradients (DDPG) as our base RL algorithm . For the planning algorithm , we use PC-RRT with quasi-static physics.

## Analyzing LeaPER

### Environments

To study the rearrangement manipulation problem, we consider three environments (Fig. 5). For all three, the BarrettHand [\citeauthoryearHasan et al.2013], which acts as the robot, has to manipulate the target object to the goal in the presence of two other movable objects. Similar to the task formulations in \citeauthorking2016rearrangement \citeyearking2016rearrangement, the final positions of the other movable objects do not matter. In Env. 2 and Env. 3, an additional immovable obstacle makes the tasks harder. To simulate quasi-static physics, needed for the PC-RRT planner, we use Box2D [\citeauthoryearCatto2011]. For full contact physics modelling, needed for learning, we use the MuJoCo simulator [\citeauthoryearTodorov, Erez, and Tassa2012]. The planning environment has deterministic dynamics, while the physics parameters for the learning environment are sampled from a Gaussian distribution. Since we are performing tabletop rearrangement manipulation, the C-state space for each physical entity is the SE(2) state on the plane of the table. The action applied is the SE(2) velocity of the robot on the table’s plane. For more detail on these environments, we direct the reader to the Appendix.

### Speedup with Planned Episodic Resets

Our first claim is that resetting the episode from planner states improves sample efficiency, which solves the rearrangement task faster. To demonstrate this, we plot learning curves in Fig. 4 and compare the effect of planner resets with vanilla HER. Across all three environments, learning with planned episodic resets is significantly faster. Env. 2 shows the most gains, where planned episodic resets can learn policies 5X faster.

### Effect of Model Relaxation in Planning

Our second claim is that the model used for planning can be further relaxed while still speeding up learning with LeaPER. To demonstrate this, we use the Weld contact model, where the target is rigidly attached to the robot upon first contact. Details of the Weld model can be found in the Appendix. This simpler model speeds up planning time and hence can solve harder rearrangement tasks. Although Weld LeaPER is slower than quasi-static LeaPER, we show in Fig. 4 that it significantly outperforms solutions using no planner at all.

To further demonstrate this effect, Fig. 6 shows that different planning models affect the number of episodes LeaPER needs to solve rearrangement manipulation tasks. For all the environments we consider, Weld LeaPER not only outperforms vanilla HER but it is more robust to the initialization of the base RL algorithm.

### Comparison to Planned Trajectories

Our final claim is that closed-loop policies learned by LeaPER are substantially more successful than either open-loop trajectories given by the planner or simple closed-loop controllers that track those trajectories. We compare the trajectories of the target object with different instantiations of transition dynamics in the MuJoCo simulator. Fig. 7 shows how LeaPER policies correct for deviations and reach the desired goal state. In contrast, running the open-loop trajectory with a velocity controller fails to generate optimal behaviour (Table 1).

Open-Loop | iLQR | LeaPER (EPS# 5000) | |
---|---|---|---|

Env. 1 | 0.00 | 0.78 | 1.00 |

Env. 2 | 0.06 | 0.08 | 0.92 |

Env. 3 | 0.02 | 0.14 | 0.99 |

One way to improve the simple velocity controller is to compute iLQR controllers [\citeauthoryearLi and Todorov2004] around the desired trajectory . To do this, we use a quadratic cost function and initialize the controls with the open loop control values. We note that in Env. 1, iLQR controllers produce better results than the simple velocity controller, but they are not as good as LeaPER’s learned policies. For Env. 2 and Env. 3, obtaining controllers with longer time horizons is challenging and fails to significantly improve the velocity controller’s performance.

### Real Robot Experiments

To transfer the learned policies to a physical 7-DOF robot manipulator, we train the policy with randomized physics parameters similar to domain randomization in \citeauthorpeng2017sim \citeyearpeng2017sim. Since the inputs to the policy are the SE(2) states of the manipulator’s end-effector and the three movable objects, we use an OptiTrack motion capture system to track these positions. To make the policy robust to tracking errors, we also add uniform observation noise of meters and radians during training.

Since the policy outputs SE(2) end-effector velocities every 0.1 seconds, we apply this velocity for 0.1 seconds on the robot. This end-effector velocity is realized via vector field planning [\citeauthoryearMussa-Ivaldi and Giszter1992]. We note that due to communication delays in fetching tracked states from the OptiTrack, feeding these states through the policy to get the action, and finally applying this action, the execution of every step exceeds 0.1 seconds. However, since we train with transition dynamics and observation randomness, the policy is able to handle these delays. The policy is visualized in Fig. 2. Additional examples are available in https://youtu.be/feS-zFq6J1c.

## Related Work

#### Rearrangement Planning.

Rearrangement planning can be considered a subset of manipulation planning [\citeauthoryearLatombe2012], where both the motion of a robot and the objects must be considered while avoiding obstacles. Multiple works [\citeauthoryearStilman and Kuffner2007, \citeauthoryearNieuwenhuisen, van der Stappen, and Overmars2008] have explored planning for a robot to reach a goal configuration. For rearrangement, however, we also need objects to reach their desired configurations. Rearrangement planning [\citeauthoryearWilfong1991, \citeauthoryearOta2004, \citeauthoryearKrontiris and Bekris2015] addresses the general form of this problem where every movable obstacle has a desired target configuration.

Early work in this field [\citeauthoryearAlami, Laumond, and Siméon1994, \citeauthoryearSiméon et al.2004] focused on pick-and-place operations, where objects could be moved by grasping them. A more relevant and easier problem to solve is when only a single object must reach its goal, while the final positions of other objects do not matter. Notable work [\citeauthoryearStilman et al.2007, \citeauthoryearKrontiris and Bekris2016] solved this problem by efficiently clearing obstructions to the target object.

One of the challenges with planning to pick objects is its limitation to objects that are light and easily graspable. Nonprehensile manipulation like pushing [\citeauthoryearLynch and Mason1996] can be applied to a wider variety of objects, and its motions can be executed faster than grasping. However, the mechanics of pushing are complex and must be integrated into the planning process to solve the two-point boundary value problem (BVP). One way to solve this is to use a quasi-static contact model [\citeauthoryearWhitney1982] and reduce the planning problem to computing a Dubins path [\citeauthoryearZhou and Mason2017]. This reduction, however, is valid only for single object manipulation with a single point of contact. To extend nonprehensile planning to multiple object interactions, physics simulators can be used to embed the physics model into the planner. \citeauthorking2015nonprehensile \citeyearking2015nonprehensile and \citeauthorking2016rearrangement \citeyearking2016rearrangement demonstrate this by using a multi-body, quasi-static contact model in Box2D with a kinodynamic RRT planner to push multiple objects with a robot. The feasibility of these plans depends on how closely real-world physics matches quasi-static physics. In practice, due to errors in system identification and environment modelling as well as non quasi-static interactions, open-loop controls can deviate from the planned trajectory, and objects can fail to reach their goal.

#### Learning for Rearrangement Manipulation.

Model-free reinforcement learning has achieved considerable success in playing Atari games [\citeauthoryearMnih et al.2015], simulated locomotion [\citeauthoryearSchulman et al.2015, \citeauthoryearLillicrap et al.2015] and manipulation [\citeauthoryearLevine et al.2016a]. However, transferring these algorithms to physical robots has been challenging due to the poor sample complexity of model-free learning and the reality gap between simulators and the real world. Randomizing physics and visual observations has shown promise in overcoming this gap [\citeauthoryearSadeghi and Levine2016, \citeauthoryearPeng et al.2017, \citeauthoryearPinto et al.2018]. To alleviate poor sample complexity, several works have investigated large-scale data collection for simple tasks like grasping [\citeauthoryearPinto and Gupta2016, \citeauthoryearLevine et al.2016b, \citeauthoryearGupta et al.2018] and pushing [\citeauthoryearAgrawal et al.2016, \citeauthoryearFinn and Levine2017, \citeauthoryearPinto et al.2016]. However, rearrangement manipulation is a long horizon and sparse reward task that involves manipulating multiple objects, rendering pure model-free learning undesirable.

One way to improve the sample complexity of model-free learning is to incorporate model-based information [\citeauthoryearSutton1991] during learning. \citeauthordeisenroth2011pilco \citeyeardeisenroth2011pilco learn the dynamics model using Gaussian Processes and use rollouts with this model for learning closed-loop policies. \citeauthornagabandi2017neural \citeyearnagabandi2017neural use learned dynamics models as a prior for model-free learning, while \citeauthorbansal2017mbmf \citeyearbansal2017mbmf use model-based priors for learning. Although these outperform pure model-free methods, they do not exploit the structure of these models. One way to do so is to plan to get an open-loop trajectory that is then executed with a learned low-level policy. However, this assumes that we have exact models for planning, and hence it has only been helpful for simple navigation tasks [\citeauthoryearFaust et al.2018]. Instead of being constrained by the planner’s solutions or an approximate transition model, LeaPER uses the planner’s solution as a prior for learning.

## Discussion

In this work, we presented a sample-efficient yet simple method for learning nonprehensile rearrangement policies. We do this by resetting episodes with the solution from rearrangement planning during learning. The closed-loop policies learned from our method, LeaPER, can successfully perform several rearrangement tasks with stochastic dynamics. Furthermore, we can learn these policies even with extremely relaxed planning dynamics like the Weld contact model.

Although we use a powerful method, HER, as our base RL algorithm, we note that our method is agnostic to the type of base RL algorithm. However,it requires an episodic environment where resets are possible. Since we use the MuJoCo environment to learn, resetting the environment the desired states is trivial. However, transferring LeaPER to the real world will require engineering the environment to make it resettable. But policies learned in a powerful simulator can be transferred to real robots either by performing accurate state estimation or by employing state-of-the-art Sim2Real methods. We transfer the learned policy using OptiTrack to achieve accurate state estimation and dynamics domain randomization for the Sim2Real transfer.

LeaPER is not limited to the domain of rearrangement manipulation: planning has been used in several other domains where controllers are frequently designed with simplified models in mind. One example of this is walking, where the spring-loaded inverted pendulum [\citeauthoryearSchwind1998] and the zero-moment point [\citeauthoryearVukobratović and Borovac2004] models are used to generate fast plans. We believe that LeaPER can generate robust closed-loop solutions in these domains as well.

## Acknowledgments

We thank Gilwoo Lee for assistance with the physical robot experiments. We also thank Sandy Kaplan, Xingyu Lin and Christopher Atkeson for feedback and suggestions. This work was supported by a NASA Space Technology Research Fellowship (#80NSSC17K0137), the National Institute of Health R01 (#R01EB019335), National Science Foundation CPS (#1544797), National Science Foundation NRI (#1637748), the Office of Naval Research, the RCTA, Amazon, and Honda.

## References

- [\citeauthoryearAgrawal et al.2016] Agrawal, P.; Nair, A. V.; Abbeel, P.; Malik, J.; and Levine, S. 2016. Learning to poke by poking: Experiential learning of intuitive physics. In NIPS.
- [\citeauthoryearAlami, Laumond, and Siméon1994] Alami, R.; Laumond, J.-P.; and Siméon, T. 1994. Two manipulation planning algorithms. In WAFR.
- [\citeauthoryearAndrychowicz et al.2017] Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, P.; and Zaremba, W. 2017. Hindsight experience replay. NIPS.
- [\citeauthoryearBansal et al.2017] Bansal, S.; Calandra, R.; Levine, S.; and Tomlin, C. 2017. MBMF: Model-Based Priors for Model-Free Reinforcement Learning. arXiv preprint arXiv:1709.03153.
- [\citeauthoryearBrockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. arXiv preprint arXiv:1606.01540.
- [\citeauthoryearCatto2011] Catto, E. 2011. Box2d: A 2d physics engine for games.
- [\citeauthoryearDeisenroth and Rasmussen2011] Deisenroth, M., and Rasmussen, C. E. 2011. PILCO: A model-based and data-efficient approach to policy search. In ICML.
- [\citeauthoryearDuan et al.2016] Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. In ICML.
- [\citeauthoryearFaust et al.2018] Faust, A.; Ramirez, O.; Fiser, M.; Oslund, K.; Francis, A.; Davidson, J.; and Tapia, L. 2018. PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning. ICRA.
- [\citeauthoryearFinn and Levine2017] Finn, C., and Levine, S. 2017. Deep visual foresight for planning robot motion. In ICRA.
- [\citeauthoryearGoyal, Ruina, and Papadopoulos1991] Goyal, S.; Ruina, A.; and Papadopoulos, J. 1991. Planar sliding with dry friction part 1. limit surface and moment function. Wear (Amsterdam, Netherlands).
- [\citeauthoryearGupta et al.2018] Gupta, A.; Murali, A.; Gandhi, D.; and Pinto, L. 2018. Robot learning in homes: Improving generalization and reducing dataset bias. arXiv preprint arXiv:1807.07049.
- [\citeauthoryearHasan et al.2013] Hasan, M. R.; Vepa, R.; Shaheed, H.; and Huijberts, H. 2013. Modelling and control of the barrett hand for grasping. In Computer Modelling and Simulation (UKSim).
- [\citeauthoryearHesse et al.2017] Hesse, C.; Plappert, M.; Radford, A.; Schulman, J.; Sidor, S.; and Wu, Y. 2017. Openai baselines.
- [\citeauthoryearHosu and Rebedea2016] Hosu, I.-A., and Rebedea, T. 2016. Playing atari games with deep reinforcement learning and human checkpoint replay. arXiv preprint arXiv:1607.05077.
- [\citeauthoryearKaelbling, Littman, and Moore1996] Kaelbling, L. P.; Littman, M. L.; and Moore, A. W. 1996. Reinforcement learning: A survey. Journal of artificial intelligence research.
- [\citeauthoryearKakade and Langford2002] Kakade, S., and Langford, J. 2002. Approximately optimal approximate reinforcement learning. In ICML.
- [\citeauthoryearKearns and Singh2002] Kearns, M., and Singh, S. 2002. Near-optimal reinforcement learning in polynomial time. Machine learning.
- [\citeauthoryearKing et al.2013] King, J. E.; Klingensmith, M.; Dellin, C. M.; Dogar, M. R.; Velagapudi, P.; Pollard, N. S.; and Srinivasa, S. S. 2013. Pregrasp manipulation as trajectory optimization. In RSS.
- [\citeauthoryearKing et al.2015] King, J. E.; Haustein, J. A.; Srinivasa, S. S.; and Asfour, T. 2015. Nonprehensile whole arm rearrangement planning on physics manifolds. In ICRA.
- [\citeauthoryearKing, Cognetti, and Srinivasa2016] King, J. E.; Cognetti, M.; and Srinivasa, S. S. 2016. Rearrangement planning using object-centric and robot-centric action spaces. In ICRA.
- [\citeauthoryearKingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [\citeauthoryearKrontiris and Bekris2015] Krontiris, A., and Bekris, K. E. 2015. Dealing with difficult instances of object rearrangement. In RSS.
- [\citeauthoryearKrontiris and Bekris2016] Krontiris, A., and Bekris, K. E. 2016. Efficiently solving general rearrangement tasks: A fast extension primitive for an incremental sampling-based planner. In ICRA.
- [\citeauthoryearLatombe2012] Latombe, J.-C. 2012. Robot motion planning, volume 124. Springer Science & Business Media.
- [\citeauthoryearLaValle1998] LaValle, S. M. 1998. Rapidly-exploring random trees: A new tool for path planning.
- [\citeauthoryearLaValle2006] LaValle, S. M. 2006. Planning algorithms. Cambridge university press.
- [\citeauthoryearLevine et al.2016a] Levine, S.; Finn, C.; Darrell, T.; and Abbeel, P. 2016a. End-to-end training of deep visuomotor policies. JMLR.
- [\citeauthoryearLevine et al.2016b] Levine, S.; Pastor, P.; Krizhevsky, A.; and Quillen, D. 2016b. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. ISER.
- [\citeauthoryearLi and Todorov2004] Li, W., and Todorov, E. 2004. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1).
- [\citeauthoryearLillicrap et al.2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- [\citeauthoryearLynch and Mason1996] Lynch, K. M., and Mason, M. T. 1996. Stable pushing: Mechanics, controllability, and planning. IJRR.
- [\citeauthoryearMason1986] Mason, M. T. 1986. Mechanics and planning of manipulator pushing operations. IJRR.
- [\citeauthoryearMnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature.
- [\citeauthoryearMussa-Ivaldi and Giszter1992] Mussa-Ivaldi, F. A., and Giszter, S. F. 1992. Vector field approximation: a computational paradigm for motor control and learning. Biological cybernetics.
- [\citeauthoryearNagabandi et al.2017] Nagabandi, A.; Kahn, G.; Fearing, R. S.; and Levine, S. 2017. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. arXiv preprint arXiv:1708.02596.
- [\citeauthoryearNair et al.2017] Nair, A.; McGrew, B.; Andrychowicz, M.; Zaremba, W.; and Abbeel, P. 2017. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089.
- [\citeauthoryearNieuwenhuisen, van der Stappen, and Overmars2008] Nieuwenhuisen, D.; van der Stappen, A. F.; and Overmars, M. H. 2008. An effective framework for path planning amidst movable obstacles.
- [\citeauthoryearOta2004] Ota, J. 2004. Rearrangement of multiple movable objects-integration of global and local planning methodology. In ICRA.
- [\citeauthoryearPeng et al.2017] Peng, X. B.; Andrychowicz, M.; Zaremba, W.; and Abbeel, P. 2017. Sim-to-real transfer of robotic control with dynamics randomization. arXiv preprint arXiv:1710.06537.
- [\citeauthoryearPinto and Gupta2016] Pinto, L., and Gupta, A. 2016. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. ICRA.
- [\citeauthoryearPinto et al.2016] Pinto, L.; Gandhi, D.; Han, Y.; Park, Y.-L.; and Gupta, A. 2016. The curious robot: Learning visual representations via physical interactions. In ECCV. Springer.
- [\citeauthoryearPinto et al.2018] Pinto, L.; Andrychowicz, M.; Welinder, P.; Zaremba, W.; and Abbeel, P. 2018. Asymmetric actor critic for image-based robot learning. RSS.
- [\citeauthoryearSadeghi and Levine2016] Sadeghi, F., and Levine, S. 2016. (CAD)2RL: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201.
- [\citeauthoryearSchulman et al.2015] Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In ICML.
- [\citeauthoryearSchwind1998] Schwind, W. J. 1998. Spring loaded inverted pendulum running: A plant model. Ph.D. Dissertation.
- [\citeauthoryearSiméon et al.2004] Siméon, T.; Laumond, J.-P.; Cortés, J.; and Sahbani, A. 2004. Manipulation planning with probabilistic roadmaps. IJRR.
- [\citeauthoryearStilman and Kuffner2007] Stilman, M., and Kuffner, J. J. 2007. Navigation among movable obstacles. Ph.D. Dissertation, Citeseer.
- [\citeauthoryearStilman et al.2007] Stilman, M.; Schamburek, J.-U.; Kuffner, J.; and Asfour, T. 2007. Manipulation planning among movable obstacles. Georgia Institute of Technology.
- [\citeauthoryearSucan, Moll, and Kavraki2012] Sucan, I. A.; Moll, M.; and Kavraki, L. E. 2012. The open motion planning library. IEEE Robotics Automation Magazine 19(4):72–82.
- [\citeauthoryearSutton1991] Sutton, R. S. 1991. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin.
- [\citeauthoryearTodorov, Erez, and Tassa2012] Todorov, E.; Erez, T.; and Tassa, Y. 2012. MuJoCo: A physics engine for model-based control. In IROS.
- [\citeauthoryearVukobratović and Borovac2004] Vukobratović, M., and Borovac, B. 2004. Zero-moment pointâthirty five years of its life. International journal of humanoid robotics.
- [\citeauthoryearWhitney1982] Whitney, D. E. 1982. Quasi-static assembly of compliantly supported rigid parts. Journal of Dynamic Systems, Measurement, and Control.
- [\citeauthoryearWilfong1991] Wilfong, G. 1991. Motion planning in the presence of movable obstacles. Annals of Mathematics and Artificial Intelligence.
- [\citeauthoryearZhou and Mason2017] Zhou, J., and Mason, M. T. 2017. Pushing revisited: Differential flatness, trajectory planning and stabilization. In ISRR.

## Appendix A Appendix

### Environment Details

All the movable objects, including the target object, are cubes with a side-length of 8cm. The obstacle in Env. 2 is 30 cm long and 8 cm wide, while the obstacle in Env. 3 is 70 cm long and 8 cm wide. Each time-step in the MuJoCo simulation is 0.1 seconds, so the policy acts at 10 Hz. The episode length for Env. 1 and Env. 2 is 50 timesteps, and for Env. 3 is 150 timesteps. Contacts are enabled for all the physical entities in the workspace (table, target object, movable objects, obstacles, and robot). We use regular frictional contacts, which can generate both normal force and tangential friction force opposing slip (equivalent to using in MuJoCo).

The action is the robot’s SE(2) velocity, which is 3-dimensional. For stable simulations, the action velocity of the robot is limited to (0.25 m/s, 0.25 m/s, 2.5 rad/s). The observation from the environment is the of every movable entity in a global frame, so the observation space is 16-dimensional. The goal region is a disc of radius 5 cm. If the target object does not come within 5 cm of the desired goal, it gets a reward of .

Before every episode, a random physics parameter is sampled from a normal parameter distribution and held constant throughout the episode. These parameters include the mass of movable entities and the friction of all physical entities. The mean is the nominal parameter value, and the standard deviation is twice the magnitude of the nominal parameter. If a sampled parameter is invalid, it is re-sampled. For training policies to run on the robot, we add an observation noise of 1 cm to the positions and 0.1 rad to the angles of every movable entity.

### Training Details

We build on the OpenAI Baselines framework [\citeauthoryearHesse et al.2017] to train our policies with HER. Our actor and critic networks both contain 3 hidden layers with 256 neurons each. The Adam [\citeauthoryearKingma and Ba2014] optimizer is used with an initial learning rate of 0.01. The number of goals set for replay in HER is 4. While applying actions, a random action is chosen 30% of the time, and a Gaussian with standard deviation of 0.2 is applied on the predicted action for the remaining 70%. These parameters are similar to HER defaults and we did not need to tune them.

### The Quasi-Static Pushing Model

Consider the robot hand and object in Fig. 8a, where the friction cones [\citeauthoryearMason1986] can be computed from the contact friction coefficient. The limit surface [\citeauthoryearGoyal, Ruina, and Papadopoulos1991] relates the generalized forces applied on the object to the resulting generalized velocity. With finite pressure, the limit surface is similar to the three-dimensional ellipsoid in Fig. 8b. Here, the dimensions are the force in the x-direction , the force in the y-direction , and the moment .

In quasi-static pushing, the generalized forces are constrained to the limit surface, and the resulting generalized velocity can be computed using the normal to the surface. Fig. 8 shows the friction cone and three-dimensional cone corresponding to the applied force. With quasi-static physics, given an applied force and friction coefficient, the object is constrained to move with velocities prescribed by the segment on the limit surface.

### Planning Details

We use the planning framework detailed in [\citeauthoryearKing et al.2015], which extends the Open Motion Planning Library (OMPL) [\citeauthoryearSucan, Moll, and Kavraki2012]. We utilize Box2D [\citeauthoryearCatto2011] to simulate the physics interactions in 2D since we constrain and plan for the end-effector on a plane parallel to the table. We enforce the quasi-static (Fig. 8(b)) and relaxed Weld (Fig. 8(c)) physics models in the simulator. As we expect, the planning times are significantly lower with the relaxed physics model. For Env. 3, the average planning time with the quasi-static model is 35.2 seconds across 10 trials. With the Weld contact model, we obtain a 2X speed-up, with an average planning time of 16.1 seconds.