“Good Robot!”: Efficient Reinforcement Learning for Multi-Step Visual Tasks with Sim to Real Transfer
In order to effectively learn multi-step tasks, robots must be able to understand the context by which task progress is defined. In reinforcement learning, much of this information is provided to the learner by the reward function. However, comparatively little work has examined how the reward function captures—or fails to capture—task context in robotics, particularly in long-horizon tasks where failure is highly consequential.
To address this issue, we describe the Schedule for Positive Task (SPOT) Reward and the SPOT-Q reinforcement learning algorithm, which efficiently learn multi-step block manipulation tasks in both simulation and real-world environments. SPOT-Q is remarkably effective compared to past benchmarks. It successfully completes simulated trials of a variety of tasks including stacking cubes (98%), clearing toys by pushing and grasping arranged in random (100%) and adversarial (95%) patterns, and creating rows of cubes (93%). Furthermore, we demonstrate direct sim to real transfer. By directly loading the simulation-trained model on the real robot, we are able to create real stacks in 90% of trials and rows in 80% of trials with no additional real-world fine-tuning. Our system is also quite efficient – models train within 1-10k actions, depending on the task. As a result, our algorithm makes learning complex, multi-step tasks both efficient and practical for real world manipulation tasks. Code is available at https://github.com/jhu-lcsr/good_robot.
Multi-step tasks in real-world settings are challenging to learn. Such tasks intertwine learning the immediate physical consequences of actions with the need to understand how these consequences affect progress toward the overall task goal. Furthermore, in contrast to traditional motion planning which assumes perfect information and known action models, learning only has access to the spatially and temporally limited information that can be gained from sensing the environment.
We describe the Schedule for Positive Task Q (SPOT-Q) learning algorithm, an approach for efficiently making use of prior spatial and temporal information within an image-based deep reinforcement learning (DRL) algorithm. Our key observation is that significant time is wasted during reinforcement learning while exploring actions which are not productive. For example, in a block stacking task like that shown in Fig 1, the knowledge that grasping at empty air will never snag an object is “common sense” for humans. However, a vanilla learning algorithm has to discover that attempting a grasp in open space is unproductive. By contrast, SPOT-Q is able to incorporate such common sense constraints which, as we show in both simulation and real-world testing, significantly accelerates both the learning problem and final task efficiency. Our models are able to focus on learning core concepts. For example, grasping an object from a stack does not contribute to making the stack higher; the robot should focus on lone, unstacked objects.
While these types of constraints are intuitive, incorporating them into DRL in a manner that leads to reliable and efficient learning is nontrivial [14, 33]. Our SPOT reward schedule (Sec. III-B) takes inspiration from a humane and effective approach to training pets sometimes called “Positive Conditioning.” Consider the goal of training a dog âSpotâ to ignore an object or event she finds particularly interesting on command. In this practice, Spot is rewarded with treats whenever partial compliance with the desired end behavior is shown, and simply removed from harmful or regressive situations with zero treats (reward). One way to achieve this is to start with multiple treats in hand, place one treat in view of Spot, and if she eagerly jumps at the treat (a negative action) the human snatches and hides the treat immediately for zero reward on that action. With repetition, Spot will eventually hesitate, and so she is immediately praised with “Good Spot!” and gets a treat separate from the one she should ignore. This approach can be expanded to new situations and behaviors, and it encourages exploration and rapid improvement once an initial partial success is achieved. As we describe in Sec III, our SPOT reward, SPOT Trial Reward, and SPOT-Q Learning are likewise designed to provide neither reward nor punishment for actions which reverse progress.
We do better than state of the art in simulation, we can train efficiently on real-world situations, and we show that we are able to perform direct sim to real on two multi-step tasks. In summary, our contributions in this article are: (1) Reinforcement Learning of multi-step robotic tasks combining model-based low level control, model-free learning of high level goals—including the progressive SPOT reward—and SPOT Trial reward schedules; (2) The SPOT-Q reinforcement learning algorithm with dynamic action space situation removal; and (3) experiments showing state of the art performance and domain transfer for models trained using SPOT-Q.
Ii Related Work
Deep Neural Networks (DNNs) have enabled the use of raw sensor data in robotic manipulation [18, 19, 13, 33, 14]. In some approaches, a DNN’s output directly corresponds to motor commands, e.g. [18, 19]. Higher-level methods, on the other hand, assume a simple model for robotic control and focus on bounding box or pose detection for downstream grasp planning [26, 35, 7, 15, 12, 16, 22, 33]. Increasingly, these methods benefit from the depth information provided by RGB-D sensors [33, 22, 24], which capture physical information about the workspace. However, the agent must still develop physical intuition, which recent work attempts in a more targeted setting. [17, 9] focus on block stacking by classifying simulated stacks as stable or likely to fall. Of these, the ShapeStacks dataset  includes a larger variety of objects such as cylinders and spheres, as opposed to blocks alone. Similarly, [8, 4] develop physical intuition by predicting push action outcomes. Our work diverges from these approaches by developing visual understanding and physical intuition simultaneously, in concert with understanding progress in multi-step tasks. Object-centric skill learning can more quickly learn skills that will better generalize, e.g. [10, 5].
Grasping is a particularly active area of research. DexNet [20, 21] learns from a large number of depth images of top-down grasps, and gets extremely good performance on grasping novel objects, but does not look at long-horizon tasks. 6-dof GraspNet  uses large amounts of simulated grasp data to generalize to new objects and has been extended to handle reliable grasping of novel objects in clutter .
Deep reinforcement learning has proven effective at increasingly complex tasks in robotic manipulation [34, 33, 32, 13]. QT-Opt  learns manipulation skills from hundreds of thousands of real-world grasp attempts on real robots. Other methods focus on transferring visuomotor skills from simulated to real robots [34, 36]. Our work directs a low-level controller to perform actions rather than regressing torque vectors directly, following prior work [33, 32] by learning a pixel-wise success likelihood map.
Domain Adaptation refers to methods for the purpose of generalizing an algorithm to a new domain. In robotics, it is common to modify training images in simulation by applying random textures for transfer to the real world [28, 3].
We compare our approach to VPG , a recently developed method for learning RL-based table clearing tasks from images which can be trained within hours on a single robot. It is frequently able to complete adversarial scenarios like those in Fig. 7, by first pushing a tightly packed group and then grasping the now-separated objects. VPG assumes an instantaneous reward delivery is sufficient to complete the task at hand. Following VPG, Form2Fit  tackles kitting tasks with well defined containers within which objects should be placed, but is not investigating reinforcement learning. By contrast, we tackle multi-step tasks with sparse rewards, no visible pattern representing a goal, and stacks which are simultaneously a goal and an obstacle. These tasks cannot be represented by either VPG or Form2Fit.
Multi-step tasks with sparse rewards present a particular challenge in reinforcement learning generally because solutions are less likely to be discovered through random exploration. When available, demonstration can be an effective method for guiding exploration [30, 2, 11]. Within robotic manipulation specifically, another approach separates multi-step tasks into modular sub-tasks comprising a sketch , while yet another separates the learning architecture into robot- and task-specific modules . Our SPOT Reward, meanwhile, combines a novel reward schedule with a dynamic action space to make early successes both more likely as well as more instructive.
We investigate the general problem of assembly through vision-based robotic manipulation. As in past work , we make the simplifying assumption that the sensor observations available at any time embed all necessary state information and thus provide sufficient information of the environment to choose correct actions—effectively equating sensor observations and state. We can then frame the problem as a Markov Decision Process , with state space , action space , transition probability function , and reward function . At time step , the agent observes state and chooses an action according to its policy . The action results in a new state with probability . As in VPG , we use Q-learning to produce a deterministic policy for choosing actions. The function estimates the expected reward of an action from a given state, i.e. the “quality” of an action, which determines the policy:
Thus, the goal of training is to learn a that maximizes over time. This is accomplished by iteratively minimizing , where the target value is:
In our experiments, we consider a robotic agent capable of being commanded to a specified arm pose and gripper state in its workspace. The agent observes the environment via a fixed RGB-D camera, which we project so that is aligned with the direction of gravity, as shown in Fig. 2. As per VPG , our height map is a square with on each side. The projected heightmap color and depth images each have a resolution of , meaning that each pixel represents roughly .
Our action space consists of three components: action types , locations , and angles . First, the set of action types consists of three high-level motion primitives . Second, we discretize the spatial action space into bins with coordinates . The angle space is similarly discretized into bins.
A traditional trajectory planner executes each action on the robot. For grasping and placing, each action moves to with gripper angle and closes or opens the gripper, respectively. A push starts with the gripper closed at and moves horizontally a fixed distance along angle . Fig. 2 visualizes our overall algorithm, including the action space and corresponding -values.
Iii-a Reward Shaping
We focus on learning neural network models which can solve multi-stage tasks with strong contextual and temporal dependencies between stages. Reward shaping is an effective technique for training these neural networks, which in our case involves rewarding intermediate sub-tasks as well as the overall task. Here, we present a general formulation for reward shaping which is conducive to novel tasks and reduces the ad hoc nature of successful reward schedules, based on action types.
Suppose we have a sub-task indicator function that equals 1 if succeeds at a sub-task and 0 otherwise. Examples of indicator sources can include human supervision, another algorithm, or the grasp detector built into our Robotiq 2F85 gripper. In our experiments, a action is successful if it perturbs an object, whereas a action is successful only if it increases the stack height. Our reward schedule is a function which determines the appropriate reward for each successful action type:
This is a desirable formulation because it allows the reward function to apply weights to different action types. For example, utilizing actions is more complex than actions, because it depends on the prior actions.
Using this formulation, we generalize the baseline fixed reward definition of VPG  with the Exponential Reward Schedule (ExpRS). At time step , this is
where and are chosen to normalize the rewards for numerical stability, and is a design parameter associated with action type . In this scheme, VPG’s fixed reward schedule can be represented with and . Eq. 4 serves as a starting point for our SPOT reward below.
Iii-B Schedule for Positive Task (SPOT) Reward
Our Schedule for Positive Task (SPOT) reward operates on two principles: actions which advance overall task progress receive a reward proportional to the quantity of progress, but actions which reverse progress receive 0 reward. In our case, we quantify the task progress () by recording stack height or row length, but many similar tasks have a naturally quantifiable measure. Thus, our reward schedule is a function of and the parameters associated with each action type:
In our experiments, we use , , and .
Additionally, our SPOT reward function includes a mechanism to prevent catastrophic forgetting. As in our story of Spot the dog (Sec. I), we wish to minimize disincentive for exploration without rewarding mistakes. To accomplish this, we employ Situation Removal, which provides zero reward at time steps where was reversed:
Iii-C SPOT Trial Reward
During experience replay of previously completed trials, we leverage future knowledge via a recursively defined reward function:
where marks the end of the trial, and can depend on , , or , using eq. 3. The effect of using is that future rewards only propagate across time steps where subtasks are completed successfully, as shown in Fig. 4. Our model is trained on past trials with this reward during prioritized experience replay  with .
Iii-D SPOT-Q Learning and Dynamic Action Spaces
Because of the high costs in robotics, it is desirable to utilize hardware time as efficiently as possible. We leverage a priori knowledge about the environment to make simple but powerful assumptions which both reduce unproductive attempts and accelerate training. Specifically, we assume that sub-task success requires physical interaction between the robot gripper and objects in the scene. That is, there exists a predictive mask function which takes the current state and an action and returns 0 if contact is impossible and thus failure is certain, 1 otherwise. This is subtly different from the success indicator , which requires the outcome of an action to determine success or failure. Using , we can define the dynamic action space :
Importantly, does not tell us whether is an action worth taking but merely whether it is worth exploring. It could be a hand-tuned rule, as we show in Sec. IV, or the output of an instance segmentation model, e.g. .
With , we introduce our SPOT-Q Learning algorithm, which modifies Eq. 2 with a new target value based on the mask function:
Crucially, SPOT-Q Learning includes backpropagation on both the masked action, which fails implicitly, and the unmasked action from , which the robot actually performs. In Sec. IV, we discuss how this allows us to surpass prior work, wherein similar heuristics failed to accelerate learning .
|Test Task||Schedule||Space||100% Complete||Complete||Efficiency|
|Clear 10 Toys||VPG ||Exp (eq. 4)||Standard||100%||100%||68%||61%|
|Clear 10 Toys||Ours||SPOT (eq. 7)||Standard||100%||100%||83%||73%|
|Clear 10 Toys||Ours||SPOT (eq. 7)||Dynamic (eq. 9)||100%||100%||84%||74%|
|Clear Toys Adversarial||VPG ||Exp (eq. 4)||Standard||5/11||84%||77%||60%|
|Clear Toys Adversarial||Ours||SPOT (eq. 7)||Standard||6/11||94%||37%||36%|
|Clear Toys Adversarial||Ours||SPOT (eq. 7)||Dynamic (eq. 9)||7/11||95%||46%||38%|
|Stack of 4 Cubes||Exp (eq. 4)||Standard||57%||29%||63%||29%|
|Stack of 4 Cubes||SPOT (eq. 7)||Dynamic, no SPOT-Q||91%||73%||64%||31%|
|Stack of 4 Cubes||SPOT (eq. 7)||Standard||97%||81%||79%||38%|
|Stack of 4 Cubes||SPOT (eq. 7)||Dynamic (eq. 9)||98%||81%||84%||57%|
|Row of 4 Cubes||SPOT (eq. 7)||Standard||92%||68%||61%||29%|
|Row of 4 Cubes||SPOT (eq. 7)||Dynamic (eq. 9)||93%||87%||85%||57%|
|Clear 20 Toys||Real||SPOT (eq. 7)||Dynamic (eq. 9)||1/1||75%||-||75%||1k|
|Stack of 4 Cubes||Real||SPOT (eq. 7)||Dynamic (eq. 9)||82%||71%||82%||60%||2.5k|
|Stack of 4 Cubes||Sim||SPOT (eq. 7)||Dynamic (eq. 9)||90%||80%||80%||59%||10k|
|Row of 4 Cubes||Sim||SPOT (eq. 7)||Dynamic (eq. 9)||80%||68%||89%||71%||10k|
We conduct both simulated and real world experiments. Simulated experiments are run in the CoppeliaSim simulator (formerly V-REP), including baseline pushing and grasping scenarios provided by VPG . We also evaluate our algorithm using two multi-step tasks of our own design: stacking and row-making. Table I, II, and III summarize these results, with descriptions and analysis below.
Iv-a Evaluation Metrics
We evaluate our algorithms in test cases with new random seeds and in accordance with the metrics found in VPG . These include the percentage of successful grasps, placement action efficiency, and task completion rate. The completion rate is defined as the percentage of trials where the policy is able to successfully complete a task before the grasp or push action fails 10 consecutive times. Success of a push is when more than 300 pixels have changed in a scene. A successful grasp is counted upon two consecutive detections by the internal Robotiq grasp sensor, once upon closing and once after lifting. A successful place for stacking is evaluated more specifically by the maximum height in the heightmap. Ideal Action Efficiency is 100% and calculated as . The Ideal Action Count is 1 action per object for grasping tasks, and for tasks which involve placement it is 2 actions per object, where one object is assumed to remain stationary. This means 6 total actions for a stack of height 4 since only 3 objects must move, and 4 total actions for rows by placing two blocks between two endpoints. We validate simulated results on 100 trials of novel random object positions.
Iv-B Baseline Scenarios
Clear 10 Toys: We establish a baseline via the primary simulated experiment found in VPG , where 10 toys with varied shapes must be grasped to clear the robot workspace. Some differences occur because we also account for our asymmetric gripper, which means training cannot automatically be applied at the both the actual gripper rotation angle and its 180°offset. Fig. 3 shows such considerations in detail. Additionally, our training performs experience replay in parallel with robot actions. Training with SPOT-Q results in a 5% increase in grasp successes and a 6% increase in action efficiency.
Table I (top) shows the results. As this is a fairly straightforward task, all methods are eventually able to complete it in the basic case, but we see that SPOT-Q training leads to the highest grasp success rate, but there is not a dramatic difference. This makes intuitive sense, since the task structure is very simple.
Clear Toys Adversarial: The second baseline scenario contains 11 challenging adversarial arrangements from VPG  where toys are placed in tightly packed configurations. We use the pretrained weights from the “Clear 10 Toys” task in scenarios the algorithm has never previously seen. We validate on 10 trials for each of the 11 challenging arrangements, and our model was able to clear the scene in 104 out of 110 trials. Table I (bottom) details the results.
Curiously, while the rate of tasks completed rises, action efficiency is reduced in these challenging arrangements. Subjectively, this is due to the higher priority placed on grasping when compared to pushing as the SPOT-Q model attempts pushes in 10% of actions in these scenarios on average. In many cases the algorithm attempts a push only after several failed attempts at grasping, which finally frees up the blocks to complete the task, while in other cases it separates blocks using grasps alone.
Iv-C Multi-Step Tasks: Stacks and Rows
We evaluate algorithm performance in several scenarios which we report in Table II.
Stack 4 Cubes: Our primary test task is to stack 4 cubes randomly placed within the scene. During training, we ensure that workspace constraints are strictly observed by deeming any action in which a partially assembled stack is subsequently toppled returns a reward of 0 and immediately ends the training trial with the failure condition. This strict progress evaluation criteria ensures the scores indicate an understanding of the context surrounding the stack.
In simulation we evaluate the basic exponential reward schedule (eq. 4); SPOT reward given in eq. 5, which accounts for task progress; and SPOT-Q (eq. 9), which dynamically restricts the action space and trains choices outside that action space with an assumed reward of 0. Table II shows the results. As evidenced by the huge difference between SPOT and the baseline reward schedule, the SPOT reward proves essential to task completion, succeeding 97% of the time, versus only 57% without it. SPOT-Q, in turn, provides a slightly better 98% completion rate with a notable increase in action efficiency from 38% to 57%.
We investigate whether SPOT-Q actually improves results or if the benefits are simply a result of the dynamic action space heuristics in the simulated “Dynamic Action Space, No SPOT-Q” test. It succeeds in just 91% of trials, compared to 97% of trials with a standard action space and 98% of trials with SPOT-Q. Early performance of training in simulation found in Fig. 6. Similarly, there is a large improvement in test action efficiency with SPOT-Q. This behavior is consistent with results in VPG , which found that a simple action heuristic led to worse performance, so we are able to conclude that SPOT-Q makes it feasible to learn from heuristic data.
Row of 4 Cubes: Our third test task evaluates the ability of the algorithm to generalize across tasks. Curiously, while making a row of 4 blocks appears similar to stacking, it is more difficult to train. In particular, whereas with stacking optimal placement occurs on top of a strong visual feature—another block—the arrangement of blocks in rows depends on non-local visual features, i.e. the rest of the row. Additionally, every block in each row is available for grasping, which may reverse progress, as opposed to stacks where only the top block is readily available. This requires significant understanding of context, as we have described it, to accomplish. Table II shows our algorithm’s performance on this difficult task, succeeding 93% of the time.
Iv-D Real Robot Experiments
Finally, we show results on a real-world test with a Universal Robot UR5 equipped with a Robotiq 2-finger gripper, as described in prior work [25, 11]. For perception, we use an externally mounted Primesense Carmine RGB-D camera. We present three tasks and four real-world scenarios enumerated in Table III.
Real robot training for block stacking takes approximately 12 hours at 20 seconds per action. We run simulated training for sim to real transfer of stacks and rows over 10,000 actions, also in real time at 15 seconds per action which takes approximately 48 hours. At test time the real robot runs at a rate of 15 seconds per action, and the difference is accounted for by experience replay running in a parallel Python thread. Unstacking during real world training and testing is largely automated by retaining a queue of past place locations which are grasped and randomly released in reverse order after a trial is complete. We periodically return objects which have left the robot workspace by hand, but a physical barrier like our bin in  would suffice. The bin was excluded for consistency with the VPG  baseline.
Real Stacking: We train from scratch on the real robot to perform the block stacking tasks, plotted in Fig. 5. This allows us to establish a real-world baseline for the SPOT-Q algorithm which we compare to the performance in simulation. Second, we compare this result against training in simulation and directly loading the final simulation model on the real robot without fine tuning.
Remarkably, our sim to real stacking model outperforms the real stacking model completing 90% of trials compared to 82%. This is particularly impressive, considering that our scene is exposed to variable sunlight. Intuitively, these results are in part due to the relevance of the depth heightmap in stacking and row-making.
Real Rows: We train in simulation and after loading the simulation based model are able to create rows in 80% of attempts, with every failure occurring due to our row checker erroneously terminating the trial early. We exclusively evaluate sim to real transfer in this case because training progress is significantly slower than with stacks.
Real Pushing and Grasping: We train the baseline pushing and grasping task from scratch in the real world which reaches 75% grasp success rate and 75% efficiency in 1k actions, which is comparable to the performance charted by VPG  chart performance over 2.5k actions. Sim to real transfer does not succeed in this task.
We expect that with minor optimizations in our code, a real robot could easily take an action once every 8-10 seconds. In future work a less simplistic action sequence could ensure the robot only move completely out of the frame as needed, allowing it to act many times faster.
We expect that block based tasks are able to transfer because the network relies primarily on the depth images, which are more consistent between simulated and real data. This might reasonably explain why pushing and grasping does not transfer, a problem which could be mitigated in future work with methods like Domain Adaptation [28, 3].
We have demonstrated that SPOT-Q is an effective approach for training long-horizon tasks, in particular for those where it is easy to reverse progress.
To our knowledge, this is the first instance of reinforcement learning with successful sim to real transfer applied to multi-step tasks such as block-stacking and creating rows.
SPOT quantifies an agent’s progress within multi-step tasks while also providing zero-reward guidance.
In combination with the proposed dynamic action space and situation removal, it is able to quickly learn policies that generalize from simulation to the real world.
We find these methods are necessary to achieve a 90% completion rate on the real block stacking task and an 80% completion rate on the row-making task
SPOT’s main limitation is that while intermediate rewards can be sparse, they are still necessary. Future research should look at ways of learning task structures which incorporate situation removal from data. In addition, the dynamic action space mask is currently manually designed; this and the lower-level open loop actions might be learned as well. Another topic for investigation is the difference underlying successful sim to real transfer transfer of stacking and row tasks when compared to pushing and grasping. Finally, in the future we would like to apply our method to more challenging tasks.
This material is based upon work supported by the NSF NRI Grant Award #1637949. Hongtao Wu’s contribution was funded under Office of Naval Research Award N00014-17-1-2124, Gregory S. Chirikjian, PI. We extend our thanks to Cori Hundt for the “Good Robot!” title copywriting, to those who provided editing and feedback, and to the VPG  authors for releasing their code.
- Video of the real robot stacking blocks and making rows with the sim to real models: https://youtu.be/QHNkghXCmY0
- (2016-11) Modular Multitask Reinforcement Learning with Policy Sketches. ArXiv e-prints. External Links: Cited by: §II.
- (2018) Playing hard exploration games by watching youtube. In Advances in Neural Information Processing Systems, pp. 2935–2945. Cited by: §II.
- (2018) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4243–4250. Cited by: §II, §IV-D.
- (2016) SE3-nets: learning rigid body motion using deep neural networks. arXiv preprint arXiv:1606.02378. Note: \urlhttps://arxiv.org/abs/1606.02378 Cited by: §II.
- (2018) Deep object-centric representations for generalizable robot learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7111–7118. Cited by: §II.
- (2017) Learning modular neural network policies for multi-task and multi-robot transfer. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2169–2176. Cited by: §II.
- (2010) Model globally, match locally: efficient and robust 3d object recognition.. In CVPR, Vol. 1, pp. 5. Cited by: §II.
- (2016-05) Unsupervised Learning for Physical Interaction through Video Prediction. ArXiv e-prints. External Links: Cited by: §II.
- (2018) ShapeStacks: learning vision-based physical intuition for generalised object stacking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 702–717. Cited by: §II.
- (2016) Learning dexterous manipulation for a soft robotic hand from human demonstrations. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3786–3793. Cited by: §II.
- (2019) The costar block stacking dataset: learning with workspace constraints. Intelligent Robots and Systems (IROS), 2019 IEEE International Conference on. External Links: Cited by: §II, §IV-D, §IV-D.
- (2017) End-to-end learning of semantic grasping. In Conference on Robot Learning, pp. 119–132. External Links: Cited by: §II.
- (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. CoRL. Cited by: §II, §II.
- (2019) A review of robot learning for manipulation: challenges, representations, and algorithms. arXiv preprint arXiv:1907.03146. Cited by: §I, §II.
- (2017-09) Robotic grasp detection using deep convolutional neural networks. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). External Links: Cited by: §II.
- (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34 (4-5), pp. 705–724. Note: Dataset:\urlhttp://pr.cs.cornell.edu/grasping/rect External Links: Cited by: §II.
- (2016) Learning physical intuition of block towers by example. International Conference on Machine Learning, pp. 430–438. Cited by: §II.
- (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §II.
- (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. Note: Dataset:\urlhttps://sites.google.com/site/brainrobotdata/home External Links: Cited by: §II.
- (2017) Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In Robotics: Science and Systems (RSS), Note: Dataset:\urlberkeleyautomation.github.io/dex-net/ Cited by: §II.
- (2019) Learning ambidextrous robot grasping policies. Science Robotics 4 (26). External Links: Cited by: §II.
- (2018-06) Closing the loop for robotic grasping: a real-time, generative grasp synthesis approach. Robotics: Science and Systems XIV. External Links: Cited by: §II.
- (2019) 6-dof graspnet: variational grasp generation for object manipulation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2901–2910. Cited by: §II.
- (2019) 6-dof grasping for target-driven object manipulation in clutter. ICRA 2020, to appear. Cited by: §II, §II.
- (2018) Evaluating methods for end-user creation of robot task plans. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6086–6092. Cited by: §IV-D.
- (2014) Real-time grasp detection using convolutional neural networks. CoRR abs/1412.3128. External Links: Cited by: §II.
- (2016) Prioritized experience replay. In ICLR 2016 : International Conference on Learning Representations 2016, Cited by: §III-C.
- (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §II, §IV-D.
- (2019) The best of both modes: separately leveraging rgb and depth for unseen object instance segmentation. CoRL 2019. Cited by: §III-D.
- (2018) Neural task programming: learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §II.
- (2019) Form2Fit: learning shape priors for generalizable assembly from disassembly. Proceedings of the IEEE International Conference on Robotics and Automation. External Links: Cited by: §II.
- (2019) TossingBot: learning to throw arbitrary objects with residual physics. arXiv preprint arXiv:1903.11239. Cited by: §II.
- (2018) Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4238–4245. Cited by: §I, §II, §II, §II, §III-A, §III-D, TABLE I, §III, §III, §IV-A, §IV-B, §IV-B, §IV-C, §IV-D, §IV-D, §IV, Acknowledgment.
- (2016) Modular deep q networks for sim-to-real transfer of visuo-motor policies. Australasian Conference on Robotics and Automation (ACRA) 2017. External Links: Cited by: §II.
- (2018) A real-time robotic grasp approach with oriented anchor box. arXiv preprint arXiv:1809.03873. External Links: Cited by: §II.
- (2018) Reinforcement and imitation learning for diverse visuomotor skills. In Robotics: Science and Systems XIV, Vol. 14. External Links: Cited by: §II.