“Good Robot!”:Efficient Reinforcement Learning for Multi-StepVisual Tasks with Sim to Real Transfer

“Good Robot!”: Efficient Reinforcement Learning for Multi-Step Visual Tasks with Sim to Real Transfer


In order to effectively learn multi-step tasks, robots must be able to understand the context by which task progress is defined. In reinforcement learning, much of this information is provided to the learner by the reward function. However, comparatively little work has examined how the reward function captures—or fails to capture—task context in robotics, particularly in long-horizon tasks where failure is highly consequential.

To address this issue, we describe the Schedule for Positive Task (SPOT) Reward and the SPOT-Q reinforcement learning algorithm, which efficiently learn multi-step block manipulation tasks in both simulation and real-world environments. SPOT-Q is remarkably effective compared to past benchmarks. It successfully completes simulated trials of a variety of tasks including stacking cubes (98%), clearing toys by pushing and grasping arranged in random (100%) and adversarial (95%) patterns, and creating rows of cubes (93%). Furthermore, we demonstrate direct sim to real transfer. By directly loading the simulation-trained model on the real robot, we are able to create real stacks in 90% of trials and rows in 80% of trials with no additional real-world fine-tuning. Our system is also quite efficient – models train within 1-10k actions, depending on the task. As a result, our algorithm makes learning complex, multi-step tasks both efficient and practical for real world manipulation tasks. Code is available at https://github.com/jhu-lcsr/good_robot.

I Introduction

Multi-step tasks in real-world settings are challenging to learn. Such tasks intertwine learning the immediate physical consequences of actions with the need to understand how these consequences affect progress toward the overall task goal. Furthermore, in contrast to traditional motion planning which assumes perfect information and known action models, learning only has access to the spatially and temporally limited information that can be gained from sensing the environment.

We describe the Schedule for Positive Task Q (SPOT-Q) learning algorithm, an approach for efficiently making use of prior spatial and temporal information within an image-based deep reinforcement learning (DRL) algorithm. Our key observation is that significant time is wasted during reinforcement learning while exploring actions which are not productive. For example, in a block stacking task like that shown in Fig 1, the knowledge that grasping at empty air will never snag an object is “common sense” for humans. However, a vanilla learning algorithm has to discover that attempting a grasp in open space is unproductive. By contrast, SPOT-Q is able to incorporate such common sense constraints which, as we show in both simulation and real-world testing, significantly accelerates both the learning problem and final task efficiency. Our models are able to focus on learning core concepts. For example, grasping an object from a stack does not contribute to making the stack higher; the robot should focus on lone, unstacked objects.

Fig. 1: Robot-created stacks and rows of cubes. Our Schedule for Positive Task (SPOT) Reward and SPOT-Q algorithm allow us to efficiently find policies which can complete multi-step tasks. Video overview: https://youtu.be/PzU-UfzLXUM
Fig. 2: Our model architecture. Images are pre-rotated to 16 orientations before being passed to the network. Every coordinate in the output pixel-wise SPOT-Q-Values corresponds to a final gripper position, orientation, and open loop action. Purple circles highlight the highest likelihood action (Eq. 9) with an arrow to the corresponding height map coordinate, and point out how these values are transformed to a gripper pose. The rotated overhead views overlay the Q value at each pixel from dark blue values near 0 to red for high probabilities. Green arrows identify the same object across two oriented views; take a moment to compare the score of a single object across all actions. Each object is scored in a way which leads to a successful stack in accordance with its surrounding context. The grasp model learns to give a high score to the lone unstacked red block for grasp actions and a low score to the yellow top of the stack, while the place model does the reverse. Here the model chooses to grasp the red block and place on the yellow, blue, and green stack.

While these types of constraints are intuitive, incorporating them into DRL in a manner that leads to reliable and efficient learning is nontrivial [14, 33]. Our SPOT reward schedule (Sec. III-B) takes inspiration from a humane and effective approach to training pets sometimes called “Positive Conditioning.” Consider the goal of training a dog “Spot” to ignore an object or event she finds particularly interesting on command. In this practice, Spot is rewarded with treats whenever partial compliance with the desired end behavior is shown, and simply removed from harmful or regressive situations with zero treats (reward). One way to achieve this is to start with multiple treats in hand, place one treat in view of Spot, and if she eagerly jumps at the treat (a negative action) the human snatches and hides the treat immediately for zero reward on that action. With repetition, Spot will eventually hesitate, and so she is immediately praised with “Good Spot!” and gets a treat separate from the one she should ignore. This approach can be expanded to new situations and behaviors, and it encourages exploration and rapid improvement once an initial partial success is achieved. As we describe in Sec III, our SPOT reward, SPOT Trial Reward, and SPOT-Q Learning are likewise designed to provide neither reward nor punishment for actions which reverse progress.

We do better than state of the art in simulation, we can train efficiently on real-world situations, and we show that we are able to perform direct sim to real on two multi-step tasks. In summary, our contributions in this article are: (1) Reinforcement Learning of multi-step robotic tasks combining model-based low level control, model-free learning of high level goals—including the progressive SPOT reward—and SPOT Trial reward schedules; (2) The SPOT-Q reinforcement learning algorithm with dynamic action space situation removal; and (3) experiments showing state of the art performance and domain transfer for models trained using SPOT-Q.

Ii Related Work

Deep Neural Networks (DNNs) have enabled the use of raw sensor data in robotic manipulation [18, 19, 13, 33, 14]. In some approaches, a DNN’s output directly corresponds to motor commands, e.g. [18, 19]. Higher-level methods, on the other hand, assume a simple model for robotic control and focus on bounding box or pose detection for downstream grasp planning [26, 35, 7, 15, 12, 16, 22, 33]. Increasingly, these methods benefit from the depth information provided by RGB-D sensors [33, 22, 24], which capture physical information about the workspace. However, the agent must still develop physical intuition, which recent work attempts in a more targeted setting. [17, 9] focus on block stacking by classifying simulated stacks as stable or likely to fall. Of these, the ShapeStacks dataset [9] includes a larger variety of objects such as cylinders and spheres, as opposed to blocks alone. Similarly, [8, 4] develop physical intuition by predicting push action outcomes. Our work diverges from these approaches by developing visual understanding and physical intuition simultaneously, in concert with understanding progress in multi-step tasks. Object-centric skill learning can more quickly learn skills that will better generalize, e.g. [10, 5].

Grasping is a particularly active area of research. DexNet [20, 21] learns from a large number of depth images of top-down grasps, and gets extremely good performance on grasping novel objects, but does not look at long-horizon tasks. 6-dof GraspNet [23] uses large amounts of simulated grasp data to generalize to new objects and has been extended to handle reliable grasping of novel objects in clutter [24].

Deep reinforcement learning has proven effective at increasingly complex tasks in robotic manipulation [34, 33, 32, 13]. QT-Opt [13] learns manipulation skills from hundreds of thousands of real-world grasp attempts on real robots. Other methods focus on transferring visuomotor skills from simulated to real robots [34, 36]. Our work directs a low-level controller to perform actions rather than regressing torque vectors directly, following prior work [33, 32] by learning a pixel-wise success likelihood map.

Domain Adaptation refers to methods for the purpose of generalizing an algorithm to a new domain. In robotics, it is common to modify training images in simulation by applying random textures for transfer to the real world [28, 3].

We compare our approach to VPG [33], a recently developed method for learning RL-based table clearing tasks from images which can be trained within hours on a single robot. It is frequently able to complete adversarial scenarios like those in Fig. 7, by first pushing a tightly packed group and then grasping the now-separated objects. VPG assumes an instantaneous reward delivery is sufficient to complete the task at hand. Following VPG, Form2Fit [31] tackles kitting tasks with well defined containers within which objects should be placed, but is not investigating reinforcement learning. By contrast, we tackle multi-step tasks with sparse rewards, no visible pattern representing a goal, and stacks which are simultaneously a goal and an obstacle. These tasks cannot be represented by either VPG or Form2Fit.

Multi-step tasks with sparse rewards present a particular challenge in reinforcement learning generally because solutions are less likely to be discovered through random exploration. When available, demonstration can be an effective method for guiding exploration [30, 2, 11]. Within robotic manipulation specifically, another approach separates multi-step tasks into modular sub-tasks comprising a sketch [1], while yet another separates the learning architecture into robot- and task-specific modules [6]. Our SPOT Reward, meanwhile, combines a novel reward schedule with a dynamic action space to make early successes both more likely as well as more instructive.

Iii Approach

Fig. 3: Temporal and workspace dependencies when stacking four blocks. Events at a current time can influence the likelihood of successful outcomes for past actions and future actions . A successful choice of action at any given will ensure both past and future actions are productive contributors to the larger task at hand, while failures indicate either a lack of progress or regression to an earlier stage. In our experiments a partial stack or row is itself a scene obstacle.

We investigate the general problem of assembly through vision-based robotic manipulation. As in past work [33], we make the simplifying assumption that the sensor observations available at any time embed all necessary state information and thus provide sufficient information of the environment to choose correct actions—effectively equating sensor observations and state. We can then frame the problem as a Markov Decision Process , with state space , action space , transition probability function , and reward function . At time step , the agent observes state and chooses an action according to its policy . The action results in a new state with probability . As in VPG [33], we use Q-learning to produce a deterministic policy for choosing actions. The function estimates the expected reward of an action from a given state, i.e. the “quality” of an action, which determines the policy:


Thus, the goal of training is to learn a that maximizes over time. This is accomplished by iteratively minimizing , where the target value is:


In our experiments, we consider a robotic agent capable of being commanded to a specified arm pose and gripper state in its workspace. The agent observes the environment via a fixed RGB-D camera, which we project so that is aligned with the direction of gravity, as shown in Fig. 2. As per VPG [33], our height map is a square with on each side. The projected heightmap color and depth images each have a resolution of , meaning that each pixel represents roughly .

Our action space consists of three components: action types , locations , and angles . First, the set of action types consists of three high-level motion primitives . Second, we discretize the spatial action space into bins with coordinates . The angle space is similarly discretized into bins.

A traditional trajectory planner executes each action on the robot. For grasping and placing, each action moves to with gripper angle and closes or opens the gripper, respectively. A push starts with the gripper closed at and moves horizontally a fixed distance along angle . Fig. 2 visualizes our overall algorithm, including the action space and corresponding -values.

Iii-a Reward Shaping

We focus on learning neural network models which can solve multi-stage tasks with strong contextual and temporal dependencies between stages. Reward shaping is an effective technique for training these neural networks, which in our case involves rewarding intermediate sub-tasks as well as the overall task. Here, we present a general formulation for reward shaping which is conducive to novel tasks and reduces the ad hoc nature of successful reward schedules, based on action types.

Suppose we have a sub-task indicator function that equals 1 if succeeds at a sub-task and 0 otherwise. Examples of indicator sources can include human supervision, another algorithm, or the grasp detector built into our Robotiq 2F85 gripper. In our experiments, a action is successful if it perturbs an object, whereas a action is successful only if it increases the stack height. Our reward schedule is a function which determines the appropriate reward for each successful action type:


where .

This is a desirable formulation because it allows the reward function to apply weights to different action types. For example, utilizing actions is more complex than actions, because it depends on the prior actions.

Using this formulation, we generalize the baseline fixed reward definition of VPG [33] with the Exponential Reward Schedule (ExpRS). At time step , this is


where and are chosen to normalize the rewards for numerical stability, and is a design parameter associated with action type . In this scheme, VPG’s fixed reward schedule can be represented with and . Eq. 4 serves as a starting point for our SPOT reward below.

Iii-B Schedule for Positive Task (SPOT) Reward

Our Schedule for Positive Task (SPOT) reward operates on two principles: actions which advance overall task progress receive a reward proportional to the quantity of progress, but actions which reverse progress receive 0 reward. In our case, we quantify the task progress () by recording stack height or row length, but many similar tasks have a naturally quantifiable measure. Thus, our reward schedule is a function of and the parameters associated with each action type:


In our experiments, we use , , and .

Additionally, our SPOT reward function includes a mechanism to prevent catastrophic forgetting. As in our story of Spot the dog (Sec. I), we wish to minimize disincentive for exploration without rewarding mistakes. To accomplish this, we employ Situation Removal, which provides zero reward at time steps where was reversed:


Iii-C SPOT Trial Reward

Fig. 4: Example of SPOT Trial Reward (eq. 7), and SPOT Reward (eq. 6) with images of key action steps. Actions 1-3: action 1 is an initial grasp, followed by a successful place where a stack of height 2 is formed. Actions 4-5: A grasp then place where the stack gets knocked over, so the reward values go to zero. Actions 11-14: Grasp and place actions lead to a full stack of 4 completing the trial. The final SPOT Trial Reward at action 14 is double the SPOT reward.

During experience replay of previously completed trials, we leverage future knowledge via a recursively defined reward function:


where marks the end of the trial, and can depend on , , or , using eq. 3. The effect of using is that future rewards only propagate across time steps where subtasks are completed successfully, as shown in Fig. 4. Our model is trained on past trials with this reward during prioritized experience replay [27] with .

Iii-D SPOT-Q Learning and Dynamic Action Spaces

Because of the high costs in robotics, it is desirable to utilize hardware time as efficiently as possible. We leverage a priori knowledge about the environment to make simple but powerful assumptions which both reduce unproductive attempts and accelerate training. Specifically, we assume that sub-task success requires physical interaction between the robot gripper and objects in the scene. That is, there exists a predictive mask function which takes the current state and an action and returns 0 if contact is impossible and thus failure is certain, 1 otherwise. This is subtly different from the success indicator , which requires the outcome of an action to determine success or failure. Using , we can define the dynamic action space :


Importantly, does not tell us whether is an action worth taking but merely whether it is worth exploring. It could be a hand-tuned rule, as we show in Sec. IV, or the output of an instance segmentation model, e.g. [29].

With , we introduce our SPOT-Q Learning algorithm, which modifies Eq. 2 with a new target value based on the mask function:


where .

Crucially, SPOT-Q Learning includes backpropagation on both the masked action, which fails implicitly, and the unmasked action from , which the robot actually performs. In Sec. IV, we discuss how this allows us to surpass prior work, wherein similar heuristics failed to accelerate learning [33].

Simulation Source Reward Action Scenarios Trials Grasp Action
Test Task Schedule Space 100% Complete Complete Efficiency
Clear 10 Toys VPG [33] Exp (eq. 4) Standard 100% 100% 68% 61%
Clear 10 Toys Ours SPOT (eq. 7) Standard 100% 100% 83% 73%
Clear 10 Toys Ours SPOT (eq. 7) Dynamic (eq. 9) 100% 100% 84% 74%
Clear Toys Adversarial VPG [33] Exp (eq. 4) Standard 5/11 84% 77% 60%
Clear Toys Adversarial Ours SPOT (eq. 7) Standard 6/11 94% 37% 36%
Clear Toys Adversarial Ours SPOT (eq. 7) Dynamic (eq. 9) 7/11 95% 46% 38%
TABLE I: Pushing and grasping baseline simulation results. The first task is to clear 10 toys for 100 trials with random arrangements, and the second is to clear 10 trials each across 11 adversarial arrangements with 110 total trials. Bold entries highlight our key algorithm improvements over the baseline. We train for 5k total actions with an allowance permitting the final trial to complete.
Simulation Reward Action Trials Grasp Place Action
Test Task Schedule Space Complete Efficiency
Stack of 4 Cubes Exp (eq. 4) Standard 57% 29% 63% 29%
Stack of 4 Cubes SPOT (eq. 7) Dynamic, no SPOT-Q 91% 73% 64% 31%
Stack of 4 Cubes SPOT (eq. 7) Standard 97% 81% 79% 38%
Stack of 4 Cubes SPOT (eq. 7) Dynamic (eq. 9) 98% 81% 84% 57%
Row of 4 Cubes SPOT (eq. 7) Standard 92% 68% 61% 29%
Row of 4 Cubes SPOT (eq. 7) Dynamic (eq. 9) 93% 87% 85% 57%
TABLE II: Multi-Step task test success rates measured out of 100% for simulated tasks involving push, grasp and place actions trained for 10k actions. Bold entries highlight our key algorithm improvements over the baseline.
Real Training Reward Action Trials Grasp Place Action Training
Test Task Domain Schedule Space Complete Efficiency Actions
Clear 20 Toys Real SPOT (eq. 7) Dynamic (eq. 9) 1/1 75% - 75% 1k
Stack of 4 Cubes Real SPOT (eq. 7) Dynamic (eq. 9) 82% 71% 82% 60% 2.5k
Stack of 4 Cubes Sim SPOT (eq. 7) Dynamic (eq. 9) 90% 80% 80% 59% 10k
Row of 4 Cubes Sim SPOT (eq. 7) Dynamic (eq. 9) 80% 68% 89% 71% 10k
TABLE III: Real robot task results. Bold entries highlight sim to real transfer.
Fig. 5: Real training progress on the Stack 4 Cubes task with SPOT-Q and the dynamic action space. Failures include missed grasps, off-stack placements, and actions in which the stack topples.
Fig. 6: The advantage of a dynamic action space with SPOT-Q is clear in this comparison of early simulated stacking progress.
Fig. 7: Grasp Success Rate, Trial Success Rate, and Action Efficiency improvement during simulated SPOT Rows Training. Higher is better. Notice the final 40% training trial success rate is much lower than the test rate of 93%. This is because situation removal is applied and the scene is reset during training when an action decreases the current row length, but is not a part of the final model testing.

Iv Experiments

We conduct both simulated and real world experiments. Simulated experiments are run in the CoppeliaSim simulator (formerly V-REP), including baseline pushing and grasping scenarios provided by VPG [33]. We also evaluate our algorithm using two multi-step tasks of our own design: stacking and row-making. Table I, II, and III summarize these results, with descriptions and analysis below.

Iv-a Evaluation Metrics

We evaluate our algorithms in test cases with new random seeds and in accordance with the metrics found in VPG [33]. These include the percentage of successful grasps, placement action efficiency, and task completion rate. The completion rate is defined as the percentage of trials where the policy is able to successfully complete a task before the grasp or push action fails 10 consecutive times. Success of a push is when more than 300 pixels have changed in a scene. A successful grasp is counted upon two consecutive detections by the internal Robotiq grasp sensor, once upon closing and once after lifting. A successful place for stacking is evaluated more specifically by the maximum height in the heightmap. Ideal Action Efficiency is 100% and calculated as . The Ideal Action Count is 1 action per object for grasping tasks, and for tasks which involve placement it is 2 actions per object, where one object is assumed to remain stationary. This means 6 total actions for a stack of height 4 since only 3 objects must move, and 4 total actions for rows by placing two blocks between two endpoints. We validate simulated results on 100 trials of novel random object positions.

Iv-B Baseline Scenarios

Clear 10 Toys: We establish a baseline via the primary simulated experiment found in VPG [33], where 10 toys with varied shapes must be grasped to clear the robot workspace. Some differences occur because we also account for our asymmetric gripper, which means training cannot automatically be applied at the both the actual gripper rotation angle and its 180°offset. Fig. 3 shows such considerations in detail. Additionally, our training performs experience replay in parallel with robot actions. Training with SPOT-Q results in a 5% increase in grasp successes and a 6% increase in action efficiency.

Table I (top) shows the results. As this is a fairly straightforward task, all methods are eventually able to complete it in the basic case, but we see that SPOT-Q training leads to the highest grasp success rate, but there is not a dramatic difference. This makes intuitive sense, since the task structure is very simple.

Clear Toys Adversarial: The second baseline scenario contains 11 challenging adversarial arrangements from VPG [33] where toys are placed in tightly packed configurations. We use the pretrained weights from the “Clear 10 Toys” task in scenarios the algorithm has never previously seen. We validate on 10 trials for each of the 11 challenging arrangements, and our model was able to clear the scene in 104 out of 110 trials. Table I (bottom) details the results.

Curiously, while the rate of tasks completed rises, action efficiency is reduced in these challenging arrangements. Subjectively, this is due to the higher priority placed on grasping when compared to pushing as the SPOT-Q model attempts pushes in 10% of actions in these scenarios on average. In many cases the algorithm attempts a push only after several failed attempts at grasping, which finally frees up the blocks to complete the task, while in other cases it separates blocks using grasps alone.

Iv-C Multi-Step Tasks: Stacks and Rows

We evaluate algorithm performance in several scenarios which we report in Table II.

Stack 4 Cubes: Our primary test task is to stack 4 cubes randomly placed within the scene. During training, we ensure that workspace constraints are strictly observed by deeming any action in which a partially assembled stack is subsequently toppled returns a reward of 0 and immediately ends the training trial with the failure condition. This strict progress evaluation criteria ensures the scores indicate an understanding of the context surrounding the stack.

In simulation we evaluate the basic exponential reward schedule (eq. 4); SPOT reward given in eq. 5, which accounts for task progress; and SPOT-Q (eq. 9), which dynamically restricts the action space and trains choices outside that action space with an assumed reward of 0. Table II shows the results. As evidenced by the huge difference between SPOT and the baseline reward schedule, the SPOT reward proves essential to task completion, succeeding 97% of the time, versus only 57% without it. SPOT-Q, in turn, provides a slightly better 98% completion rate with a notable increase in action efficiency from 38% to 57%.

We investigate whether SPOT-Q actually improves results or if the benefits are simply a result of the dynamic action space heuristics in the simulated “Dynamic Action Space, No SPOT-Q” test. It succeeds in just 91% of trials, compared to 97% of trials with a standard action space and 98% of trials with SPOT-Q. Early performance of training in simulation found in Fig. 6. Similarly, there is a large improvement in test action efficiency with SPOT-Q. This behavior is consistent with results in VPG [33], which found that a simple action heuristic led to worse performance, so we are able to conclude that SPOT-Q makes it feasible to learn from heuristic data.

Row of 4 Cubes: Our third test task evaluates the ability of the algorithm to generalize across tasks. Curiously, while making a row of 4 blocks appears similar to stacking, it is more difficult to train. In particular, whereas with stacking optimal placement occurs on top of a strong visual feature—another block—the arrangement of blocks in rows depends on non-local visual features, i.e. the rest of the row. Additionally, every block in each row is available for grasping, which may reverse progress, as opposed to stacks where only the top block is readily available. This requires significant understanding of context, as we have described it, to accomplish. Table II shows our algorithm’s performance on this difficult task, succeeding 93% of the time.

Iv-D Real Robot Experiments

Finally, we show results on a real-world test with a Universal Robot UR5 equipped with a Robotiq 2-finger gripper, as described in prior work [25, 11]. For perception, we use an externally mounted Primesense Carmine RGB-D camera. We present three tasks and four real-world scenarios enumerated in Table III.

Real robot training for block stacking takes approximately 12 hours at 20 seconds per action. We run simulated training for sim to real transfer of stacks and rows over 10,000 actions, also in real time at 15 seconds per action which takes approximately 48 hours. At test time the real robot runs at a rate of 15 seconds per action, and the difference is accounted for by experience replay running in a parallel Python thread. Unstacking during real world training and testing is largely automated by retaining a queue of past place locations which are grasped and randomly released in reverse order after a trial is complete. We periodically return objects which have left the robot workspace by hand, but a physical barrier like our bin in [11] would suffice. The bin was excluded for consistency with the VPG [33] baseline.

Real Stacking: We train from scratch on the real robot to perform the block stacking tasks, plotted in Fig. 5. This allows us to establish a real-world baseline for the SPOT-Q algorithm which we compare to the performance in simulation. Second, we compare this result against training in simulation and directly loading the final simulation model on the real robot without fine tuning.

Remarkably, our sim to real stacking model outperforms the real stacking model completing 90% of trials compared to 82%. This is particularly impressive, considering that our scene is exposed to variable sunlight. Intuitively, these results are in part due to the relevance of the depth heightmap in stacking and row-making.

Real Rows: We train in simulation and after loading the simulation based model are able to create rows in 80% of attempts, with every failure occurring due to our row checker erroneously terminating the trial early. We exclusively evaluate sim to real transfer in this case because training progress is significantly slower than with stacks.

Real Pushing and Grasping: We train the baseline pushing and grasping task from scratch in the real world which reaches 75% grasp success rate and 75% efficiency in 1k actions, which is comparable to the performance charted by VPG [33] chart performance over 2.5k actions. Sim to real transfer does not succeed in this task.

We expect that with minor optimizations in our code, a real robot could easily take an action once every 8-10 seconds. In future work a less simplistic action sequence could ensure the robot only move completely out of the frame as needed, allowing it to act many times faster.

We expect that block based tasks are able to transfer because the network relies primarily on the depth images, which are more consistent between simulated and real data. This might reasonably explain why pushing and grasping does not transfer, a problem which could be mitigated in future work with methods like Domain Adaptation [28, 3].

V Conclusion

We have demonstrated that SPOT-Q is an effective approach for training long-horizon tasks, in particular for those where it is easy to reverse progress. To our knowledge, this is the first instance of reinforcement learning with successful sim to real transfer applied to multi-step tasks such as block-stacking and creating rows. SPOT quantifies an agent’s progress within multi-step tasks while also providing zero-reward guidance. In combination with the proposed dynamic action space and situation removal, it is able to quickly learn policies that generalize from simulation to the real world. We find these methods are necessary to achieve a 90% completion rate on the real block stacking task and an 80% completion rate on the row-making task1.

SPOT’s main limitation is that while intermediate rewards can be sparse, they are still necessary. Future research should look at ways of learning task structures which incorporate situation removal from data. In addition, the dynamic action space mask is currently manually designed; this and the lower-level open loop actions might be learned as well. Another topic for investigation is the difference underlying successful sim to real transfer transfer of stacking and row tasks when compared to pushing and grasping. Finally, in the future we would like to apply our method to more challenging tasks.


This material is based upon work supported by the NSF NRI Grant Award #1637949. Hongtao Wu’s contribution was funded under Office of Naval Research Award N00014-17-1-2124, Gregory S. Chirikjian, PI. We extend our thanks to Cori Hundt for the “Good Robot!” title copywriting, to those who provided editing and feedback, and to the VPG [33] authors for releasing their code.


  1. Video of the real robot stacking blocks and making rows with the sim to real models: https://youtu.be/QHNkghXCmY0


  1. J. Andreas, D. Klein and S. Levine (2016-11) Modular Multitask Reinforcement Learning with Policy Sketches. ArXiv e-prints. External Links: 1611.01796 Cited by: §II.
  2. Y. Aytar, T. Pfaff, D. Budden, T. Paine, Z. Wang and N. de Freitas (2018) Playing hard exploration games by watching youtube. In Advances in Neural Information Processing Systems, pp. 2935–2945. Cited by: §II.
  3. K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor and K. Konolige (2018) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4243–4250. Cited by: §II, §IV-D.
  4. A. Byravan and D. Fox (2016) SE3-nets: learning rigid body motion using deep neural networks. arXiv preprint arXiv:1606.02378. Note: \urlhttps://arxiv.org/abs/1606.02378 Cited by: §II.
  5. C. Devin, P. Abbeel, T. Darrell and S. Levine (2018) Deep object-centric representations for generalizable robot learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7111–7118. Cited by: §II.
  6. C. Devin, A. Gupta, T. Darrell, P. Abbeel and S. Levine (2017) Learning modular neural network policies for multi-task and multi-robot transfer. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2169–2176. Cited by: §II.
  7. B. Drost, M. Ulrich, N. Navab and S. Ilic (2010) Model globally, match locally: efficient and robust 3d object recognition.. In CVPR, Vol. 1, pp. 5. Cited by: §II.
  8. C. Finn, I. Goodfellow and S. Levine (2016-05) Unsupervised Learning for Physical Interaction through Video Prediction. ArXiv e-prints. External Links: 1605.07157 Cited by: §II.
  9. O. Groth, F. B. Fuchs, I. Posner and A. Vedaldi (2018) ShapeStacks: learning vision-based physical intuition for generalised object stacking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 702–717. Cited by: §II.
  10. A. Gupta, C. Eppner, S. Levine and P. Abbeel (2016) Learning dexterous manipulation for a soft robotic hand from human demonstrations. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3786–3793. Cited by: §II.
  11. A. Hundt, V. Jain, C. Lin, C. Paxton and G. D. Hager (2019) The costar block stacking dataset: learning with workspace constraints. Intelligent Robots and Systems (IROS), 2019 IEEE International Conference on. External Links: Link Cited by: §II, §IV-D, §IV-D.
  12. E. Jang, S. Vijayanarasimhan, P. Pastor, J. Ibarz and S. Levine (2017) End-to-end learning of semantic grasping. In Conference on Robot Learning, pp. 119–132. External Links: Link Cited by: §II.
  13. D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan and V. Vanhoucke (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. CoRL. Cited by: §II, §II.
  14. O. Kroemer, S. Niekum and G. D. Konidaris (2019) A review of robot learning for manipulation: challenges, representations, and algorithms. arXiv preprint arXiv:1907.03146. Cited by: §I, §II.
  15. S. Kumra and C. Kanan (2017-09) Robotic grasp detection using deep convolutional neural networks. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). External Links: ISBN 9781538626825, Link, Document Cited by: §II.
  16. I. Lenz, H. Lee and A. Saxena (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34 (4-5), pp. 705–724. Note: Dataset:\urlhttp://pr.cs.cornell.edu/grasping/rect External Links: Document, Link, https://doi.org/10.1177/0278364914549607 Cited by: §II.
  17. A. Lerer, S. Gross and R. Fergus (2016) Learning physical intuition of block towers by example. International Conference on Machine Learning, pp. 430–438. Cited by: §II.
  18. S. Levine, C. Finn, T. Darrell and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §II.
  19. S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz and D. Quillen (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. Note: Dataset:\urlhttps://sites.google.com/site/brainrobotdata/home External Links: Document, Link, https://doi.org/10.1177/0278364917710318 Cited by: §II.
  20. J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea and K. Goldberg (2017) Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In Robotics: Science and Systems (RSS), Note: Dataset:\urlberkeleyautomation.github.io/dex-net/ Cited by: §II.
  21. J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley and K. Goldberg (2019) Learning ambidextrous robot grasping policies. Science Robotics 4 (26). External Links: Link Cited by: §II.
  22. D. Morrison, J. Leitner and P. Corke (2018-06) Closing the loop for robotic grasping: a real-time, generative grasp synthesis approach. Robotics: Science and Systems XIV. External Links: ISBN 9780992374747, Link, Document Cited by: §II.
  23. A. Mousavian, C. Eppner and D. Fox (2019) 6-dof graspnet: variational grasp generation for object manipulation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2901–2910. Cited by: §II.
  24. A. Murali, A. Mousavian, C. Eppner, C. Paxton and D. Fox (2019) 6-dof grasping for target-driven object manipulation in clutter. ICRA 2020, to appear. Cited by: §II, §II.
  25. C. Paxton, F. Jonathan, A. Hundt, B. Mutlu and G. D. Hager (2018) Evaluating methods for end-user creation of robot task plans. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6086–6092. Cited by: §IV-D.
  26. J. Redmon and A. Angelova (2014) Real-time grasp detection using convolutional neural networks. CoRR abs/1412.3128. External Links: Link, 1412.3128 Cited by: §II.
  27. T. Schaul, J. Quan, I. Antonoglou and D. Silver (2016) Prioritized experience replay. In ICLR 2016 : International Conference on Learning Representations 2016, Cited by: §III-C.
  28. J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §II, §IV-D.
  29. C. Xie, Y. Xiang, A. Mousavian and D. Fox (2019) The best of both modes: separately leveraging rgb and depth for unseen object instance segmentation. CoRL 2019. Cited by: §III-D.
  30. D. Xu, S. Nair, Y. Zhu, J. Gao, A. Garg, L. Fei-Fei and S. Savarese (2018) Neural task programming: learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §II.
  31. K. Zakka, A. Zeng, J. Lee and S. Song (2019) Form2Fit: learning shape priors for generalizable assembly from disassembly. Proceedings of the IEEE International Conference on Robotics and Automation. External Links: Link Cited by: §II.
  32. A. Zeng, S. Song, J. Lee, A. Rodriguez and T. Funkhouser (2019) TossingBot: learning to throw arbitrary objects with residual physics. arXiv preprint arXiv:1903.11239. Cited by: §II.
  33. A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez and T. Funkhouser (2018) Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4238–4245. Cited by: §I, §II, §II, §II, §III-A, §III-D, TABLE I, §III, §III, §IV-A, §IV-B, §IV-B, §IV-C, §IV-D, §IV-D, §IV, Acknowledgment.
  34. F. Zhang, J. Leitner, M. Milford and P. Corke (2016) Modular deep q networks for sim-to-real transfer of visuo-motor policies. Australasian Conference on Robotics and Automation (ACRA) 2017. External Links: Link Cited by: §II.
  35. H. Zhang, X. Zhou, X. Lan, J. Li, Z. Tian and N. Zheng (2018) A real-time robotic grasp approach with oriented anchor box. arXiv preprint arXiv:1809.03873. External Links: Link Cited by: §II.
  36. Y. Zhu, Z. Wang, J. Merel, A. A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kramár, R. Hadsell, N. de Freitas and N. Heess (2018) Reinforcement and imitation learning for diverse visuomotor skills. In Robotics: Science and Systems XIV, Vol. 14. External Links: Link Cited by: §II.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description