Efficient Reinforcement Learning for Multi-Step
Visual Tasks via Reward Shaping
In order to learn effectively, robots must be able to extract the intangible context by which task progress and mistakes are defined. In the domain of reinforcement learning, much of this information is provided by the reward function. Hence, reward shaping is a necessary part of how we can achieve state-of-the-art results on complex, multi-step tasks. However, comparatively little work has examined how reward shaping should be done so that it captures task context, particularly in scenarios where the task is long-horizon and failure is highly consequential. Our Schedule for Positive Task (SPOT) reward trains our Efficient Visual Task (EVT) model to solve problems that require an understanding of both task context and workspace constraints of multi-step block arrangement tasks. In simulation EVT can completely clear adversarial arrangements of objects by pushing and grasping in 99% of cases vs an 82% baseline in prior work. For random arrangements EVT clears 100% of test cases at 86% action efficiency vs 61% efficiency in prior work. EVT + SPOT is also able to demonstrate context understanding and complete stacks in 74% of trials compared to a baseline of 5% with EVT alone. To our knowledge, this is the first instance of a Reinforcement Learning based algorithm successfully completing such a challenge. Code is available at https://github.com/jhu-lcsr/good_robot.
Multi-step tasks pose a significant challenge for robots, both because of their complexity and because it can be very easy to undo progress. Historically, this challenge required the use of a task planner, which represents the constraints of a task and searches for an efficient solution. Similarly, learning multi-step tasks requires an agent to understand each action within the context of a larger goal, but with limited knowledge of the state space. Whereas a motion planner constrains its search space according to a priori knowledge about the task space, an agent must learn whether each action contributes toward the end goal or reverses previous progress. We explore mechanisms for efficiently propagating this context information via Deep Reinforcement Learning (DRL) within the domain of block arrangement tasks.
We use the term context to denote information necessary to advance progress in a block stacking task (Fig. 1). For example, a learning algorithm has to discover that grasping a block from an existing stack is unproductive, and imprecisely grasping or placing a block entails a risk of toppling the stack. Likewise, an agent must learn about workspace constraints and develop and apply complex manipulation skills such as pushing or reorienting blocks (Fig. 2).
Discovering and effectively applying contextual knowledge is nontrivial, so behavior which demonstrates an understanding of context should be rewarded. Our proposed SPOT reward schedule (Sec. III-B) takes inspiration from a humane and effective approach to training pets sometimes called “Positive Conditioning”. Consider the goal of training a dog “Spot” to ignore an object or event she finds particularly interesting on command. In this practice, Spot is rewarded with treats whenever partial compliance with the desired end behavior is shown, and simply removed from harmful or regressive situations with zero treats (reward). One way to achieve this is to start with multiple treats in hand, place one treat in view of Spot, and if she eagerly jumps at the treat (a negative action) the human snatches and hides the treat immediately for zero reward on that action. With repetition, Spot will eventually hesitate, and so she is immediately praised with “Good Spot!” and gets a treat separate from the one she should ignore. This approach can be expanded to new situations and behaviors, plus it encourages exploration and rapid improvement once an initial partial success is achieved. As we describe in more detail below, our SPOT reward is likewise designed to provide neither reward nor punishment for actions which reverse progress.
No reward will be effective if the agent is unable to learn the task at hand in a reasonable amount of time. Thus, learning system design (in our case neural network design) goes hand in hand with reward design. In this work, we introduce a novel Efficient Visual Task (EVT) deep convolutional architecture for perception-based manipulation tasks. Furthermore, we demonstrate a novel Schedule for Positive Task (SPOT) reward enabling new capabilities for multi-step robotic tasks with Deep Reinforcement Learning.
In summary, our contributions in this article are: (1) EVT, an efficient and accurate network model for visual tasks; (2) Reinforcement Learning of Multi-step robotic tasks combining model-based low level control, model-free learning of high level goals, and a progressive reward schedule; and (3) The Schedule for Positive Task (SPOT) reward for long-horizon robotic tasks. Combining the above enables DRL to train a network on “hard” multi-step tasks in a reasonable number of iterations.
Our grasping and placing Action Efficiency Error (Sec. IV-A) is 47% of the error found in previous work, while utilizing 49% of the computational resources. More importantly, we are able to complete long term multi-step tasks which are, to our knowledge, not viable with existing baseline algorithms.
Ii Related Work
The advent of deep learning has introduced novel approaches to robotic manipulation tasks like pushing and grasping. Within this space, our work confronts fundamental problems that arise in multi-step tasks like stacking or arranging blocks. We review these areas here in brief.
Deep Neural Networks (DNNs) in particular have enabled the use of raw images in robotic manipulation. In some approaches, the DNN’s output directly corresponds to motor commands, e.g. [14, 15]. Higher level methods, on the other hand, assume a simple model for robotic control and focus on bounding box or pose detection for downstream grasp planning [17, 25, 20, 5, 11, 10, 12, 16]. Increasingly, these methods benefit from the depth information provided by RGB-D sensors [23, 20, 16], which capture physical information about the workspace. However, the agent must still develop physical intuition, which recent work attempts in a more targeted setting. [13, 8], for instance, focus on block stacking by classifying simulated stacks as stable or likely to fall. Of these, the ShapeStacks dataset  includes a larger variety of objects such as cylinders and spheres, as opposed to blocks alone. Similarly, [7, 3] develop physical intuition by predicting push action outcomes. Our work diverges from these approaches by developing visual understanding and physical intuition simultaneously, in concert with understanding progress in multi-step tasks.
When paired with DNNs, reinforcement learning has proven effective at increasingly complex tasks in robotic manipulation. An early approach in this space  directly coordinates RGB vision with servo motor control, learning tasks like unscrewing a bottle or using a hammer. Other methods focus on transferring visuo-motor skills from simulated to real robots [24, 26]. Our work directs a low-level controller to perform actions rather than regressing torque vectors directly, following [23, 22] by learning a pixel-wise success likelihood map.
Notably, VPG is state of the art for RL-based table clearing tasks which can be trained within hours on a single robot from images. It is frequently able to complete adversarial scenarios like those in Fig. 4, by first pushing a tightly packed group and then grasping the now-separated objects. We utilize the VPG V-REP simulation and models as our baseline for comparison. VPG assumes an instantaneous reward delivery is sufficient to complete the task at hand. By contrast, we tackle multi-step tasks with sparse rewards which cannot be represented by VPG.
Multi-step tasks with sparse rewards present a challenge in reinforcement learning generally because they are less likely to be discovered through random exploration. This suggests demonstration is an effective method for guiding exploration [21, 2]. Within robotic manipulation, one approach separates a multi-step task into many modular sub-tasks comprising a sketch , while another separates the learning architecture into robot- and task-specific modules . Our SPOT Reward, meanwhile, combines a novel reward schedule with a frequent reset policy to make early successes both more likely as well as more instructive.
Finally, neural architecture search forms the basis for our hyperparameter choices[19, 6]. Neural networks are imperfect arbitrary function approximators, so a better choice of algorithm is an effective approach to improving deep learning based robotic manipulation algorithms, as we have detailed in past work.
We formulate the problem of visual picking, pushing, and placing to complete a structure as a Partially Observable Markov Decision Process (POMDP) , with state space , observation space , action space , transition , and reward function . At every time step , the robot chooses an action to take according to its policy , which results in a state transition to .
As in VPG , our goal is to learn a deterministic policy via Q-learning, which chooses the action at every such that . In our case, as in past work, we make the simplifying assumption that state can be identified from a single observation, in which case we can instead frame this as an MDP over observations . Formally, we learn by iteratively minimizing the temporal difference error of to a target , where:
In our case, consists of RGB-D heightmap images, as shown in Fig. 3. We capture these from a fixed camera, which are first converted into a point cloud and then projected so that is aligned with the direction of gravity. As per , our height maps cover an 0.448 table space. Images have a resolution of , meaning that each pixel represents roughly .
Our actions are represented as a 4-tuple . We will make two simplifying assumptions in order to handle pick and place tasks for assembly. First, we divide actions into set of high-level motion primitives , where . and represent 2D planar coordinates that parameterize this action primitive, and is an angle at which to position the gripper. Each also defines the appropriate gripper behavior. In practice we use discrete values for by passing multiple discrete rotations of the input heightmap into , again following VPG .
Iii-a Network Architecture
Fig. 3 shows our overall Efficient Visual Task (EVT) solution, including choice of actions, how the robot acts on the world, and the model architecture. EVT utilizes 2 EfficientNet-B0 models, one for grasping and placing, with weights shared between color and depth data. We modify the EfficientNet-B0  pretrained on ImageNet to an FCN  by loading the final stride 2 convolution as a stride 1 convolution with a dilation rate of 2. This dilated convolution ensures more fine grained action choices are possible by doubling the effective output resolution of the network.
The final dense grasp, push, and place blocks consist of a batchnorm and relu before each of two 1x1 convolutions with 2560 channels, i.e. [bn, relu, conv1x1, bn, relu, conv1x1], where a 1x1 convolution is equivalent to a dense layer at each pixel. These parameters are based on the final dense block structure optimized for accuracy via HyperTree Architecture Search  in our prior work. We note that efficiency was not considered in the HyperTree metric and as a result this pixel-wise dense block accounts for over 50% of the computation in EVT, so it is a good target for future efficiency gains. The push and grasp EVT architecture consumes 46B FLOPS and holds 24M parameters. The added dense block for placement actions brings this to 57B FLOPS and 30M parameters.
In our experiments we compare with VPG  which is the current state of the art DRL algorithm for clearing objects from a surface. It incorporates 4 separate Densenet-121 models for grasp RGB, grasp Depth, push RGB, and push Depth with a total of 94B floating point add-multiply operations (FLOPS). A push, grasp, and place configuration of VPG has 141.3B FLOPS and 48M parameters, which is 147% more FLOPS than EVT.
Iii-B SPOT Reward
We focus on learning neural network models in a manner which can solve multi-stage tasks with strong contextual and temporal dependencies between stages. As such, we assume there is a well-defined notion of progress throughout a given task. To this end, consider a set of normalized rewards with range . We generalize the baseline fixed reward definition of VPG  with the Exponential Reward Schedule (ExpRS):
For example, VPG’s fixed rewards of 0.5 and 1 can be represented with , is simply a reasonable choice of exponential base, is the current action iteration in a given trial. For example, a successful push gets score when we ignore future discounted rewards. Using these parameters we also chose an ExpRS place reward of 2 for multi-step tasks, and the variables we have defined here are a starting point for our SPOT reward below.
Our Schedule for Positive Task (SPOT) reward has two components: a linearly increasing sub-task “partial compliance” reward is delivered for actions which make progress on the task and a reward of 0 is delivered for actions which result in a reversal of progress. Formally, each reward is computed from different sub-tasks with a number associated with the current active sub-task . Examples of possible values for include one of [grasp=1, push=2, place=3] at a specific action in the sequence of actions during the trial, so varies depending on the action taken at a given time , and is . A current task progress depth indicates linear progress through a task, such as stack height. The range of possible reward values is defined between [,]. First, we wish to expand the skill set, and each mastered skill provides exponential increases in overall capability. For this reason we ensure rewards grow faster as more parts of a curriculum are mastered in the Positive Reward (PR) Component:
Second, as in our story of Spot the dog (Sec. I), we also wish to minimize disincentive for exploration without rewarding mistakes. This second component is called Situation Removal (SR), and is applied separately at each action time step:
Here is from eq. 2 and subtask rewards are given if the “partial compliance” subtask is successfully completed, and 0 otherwise. For stacking tasks we choose and the actions for which subtask rewards are delivered include [none, successful scene change, grasp success, place success] with values . Task depth is the stack height with possible values of . In addition to the instantaneous version of SPOT above, we also define a recursive SPOT Trial Reward for use during experience replay of previous full trials:
These values are recursively rolled out from the final action to the first action of a trial. The effect of this trial reward is that future rewards only propagate across time steps where subtasks are completed successfully.
|Clear 10 Toys||VPG||Exp (eq. 1)||100%||100%||68%||94||32||61%|
|Clear 10 Toys||EVT||Exp (eq. 1)||100%||100%||87%||46||24||82%|
|Clear 10 Toys||EVT||SPOT (eq. 4)||100%||100%||87%||46||24||86%|
|Clear Toys Adversarial||VPG||Exp (eq. 1)||5/11||84%||77%||94||32||60%|
|Clear Toys Adversarial||EVT||Exp (eq. 1)||10/11||99%||62%||46||24||51%|
|Stack of 4 Cubes||EVT||Exp (eq. 1)||5%||84%||45%||57||30||4%|
|Stack of 4 Cubes||EVT||SPOT (eq. 3)||74%||93%||83%||57||30||63%|
|Row of 4 Cubes||EVT||SPOT (eq. 4)||92%||68%||61%||57||30||44%|
We conducted simulated experiments in the two baseline scenarios provided by VPG  with the same simulator and settings, as well as in two multi-step tasks of our own design. We present the results of these experiments in Table II, with descriptions and analysis below.
Iv-a Evaluation Metrics
We evaluate our algorithms in test cases with new random seeds and in accordance with the metrics found in VPG. These include the percentage of successful grasps, placement action efficiency, and task completion rate. The completion rate is defined as the percentage of trials where the policy is able to successfully complete a task before the grasp or push action fails 10 consecutive times. Success of a push is when more than 300 pixels have changed in a scene. A successful grasp is counted when the gripper is in a partially open state after executing the open loop action, indicating an object is present in the gripper, and the closed gripper state indicates grasp failure. A successful place for stacking is evaluated more specifically by the z height of object origins when object poses are known. Alternately it can be awarded when height the highest vertical z height of a scene has increased by a minimum threshold. Ideal Action Efficiency is 100% and calculated as the . The Ideal Action Count is 1 action per object for grasping tasks, and for tasks which involve placement it is against 2 actions per object, where one object is assumed to remain stationary. This means 6 total actions for a stack of height 4 since only 3 objects must move.
Iv-B Baseline Scenarios
Clear 10 Toys: We establish a baseline via the primary simulated experiment found in VPG , where 10 toys with varied shapes must be grasped to clear the robot workspace. EVT reduces the Action Efficiency Error (1 - Action Efficiency) from 39% with VPG to 14% with EVT. We validate on 100 trials of 10 novel random object positions, where .
Table I(top) shows the results. As this is a fairly straightforward task, all methods were eventually able to complete it in the basic case, but we see that our proposed EVT model has the highest grasp success rate. In this case, we can also see SPOT does not make a meaningful difference – which makes intuitive sense, as the task structure is very simple.
Clear Toys Adversarial: Our second more challenging baseline scenario (Fig. 5) contains 11 adversarial arrangements from prior work  where toys are placed in tightly packed configurations. We use the pretrained weights from the “Clear 10 Toys” task in scenarios the algorithm has never previously seen. The evaluation algorithm defined by VPG is designed such that when no successful grasp is made for 10 consecutive actions the task is considered incomplete. We validate on 10 trials for each of the 11 challenge arrangements, and our model was able to clear the scene in 109 out of 110 trials. Table I(bottom) details the results: EVT performed substantially better. In EVT’s lone failure case the model had successfully separated the tightly packed blocks on the final 10th action without a successful grasp. It is reasonable to expect it would have finished clearing the scene in the next few actions had it not hit the incomplete task limit.
Curiously, while the rate of Tasks Completed rises, Action Efficiency is reduced in these challenging scenarios. Subjectively, this is due to the higher priority placed on grasping when compared to pushing as EVT attempts grasps in 89% of actions in these scenarios on average. In many cases the algorithm attempts a push only after several failed attempts at grasping, which finally frees up the blocks to complete the task, while in other cases it can separate blocks using grasps alone.
Iv-C Multi-Step Task Scenarios
We attempted to evaluate a direct extension of the VPG algorithm in which we simply add new DenseNet-121 place action models for RGB and Depth but the architecture exceeds the memory limits of our GTX 2080Ti GPUs (and the Titan X GPUs from VPG), illustrating limitations in the scalability of VPG when compared to our EVT architecture for new tasks.
Stack 4 Cubes: Our primary test task is to stack 4 cubes randomly placed within the scene. We ensure that workspace constraints are strictly observed by deeming any action in which a partial stack the robot has already assembled is subsequently toppled returns a reward of 0 and immediately ends the trial with the failure condition. This strict progress evaluation criteria ensures the scores indicate an understanding of the context surrounding the stack. A block can also occasionally tumble out of the workspace after a missed place action, which leaves no opportunity for recovery.
EVT is evaluated under the basic exponential reward schedule (eq. 1) and our SPOT Reward given in eq. 2, which accounts for task progress. Table II shows the results. As evidenced by the huge difference between SPOT and the baseline reward schedule, the SPOT Reward proves essential to task completion. EVT trained with SPOT succeeds 74% of the time, versus only 5% without it. This is because it is impossible to differentiate between placing one block on another single block vs stack of height 2 with the exponential reward curve.
Row of 4 Cubes: Our third test task evaluates the ability of the algorithm to generalize across tasks. Curiously, while making a row of 4 blocks appears the same as stacking, it is in fact much more difficult to train to complete efficiently. In particular, whereas with stacking optimal placement occurs on top of a strong visual feature—another block—the arrangement of blocks in rows depends on non-local visual features, i.e. the rest of the row. Additionally, every block in each row is available for grasping, which may reverse progress, as opposed to stacks where only the top block is readily available. This requires significant understanding of context, as we have described it, to accomplish. Table II shows significant progress on this challenging environment, succeeding 92% of the time. The higher overall task completion rate for rows when compared to stacks is in part due to reduced risk of a block tumbling out of the robot workspace.
In spite of its obvious importance to results in reinforcement learning, reward shaping is one of the most under-explored areas in deep learning research for robotics. We have demonstrated an effective approach for training long-horizon tasks which present a high risk for reversing progress: the SPOT reward. To our knowledge, this is the first instance of reinforcement learning applied toward such a challenge. First, our EVT neural network model far exceeds existing methods’ computational efficiency for manipulation tasks while simultaneously providing a 20% increase in action efficiency and a 15% higher perfect completion rate for adversarial pushing and grasping scenarios. Our results show the continued importance of neural network architecture design choices for Robotics and Reinforcement Learning algorithms. Second, our SPOT Reward quantifies an agent’s progress within multi-step tasks while also providing zero-reward guidance, which we found necessary to achieve a 74% completion rate on a block stacking task and a 92% completion rate in a row creation task. Our work does assume some mechanism for intermediate rewards. Related methods which could help relax this assumption include inverse reinforcement learning of reward signals necessary to complete tasks, learning from demonstration, and metalearning. We expect that recent reinforcement learning algorithms beyond a Q function could also improve the efficiency of our algorithm. Nonetheless, we believe these principles are an effective approach for efficient learning on complex multi-step problems.
Finally, we note that our simulation executes in real time to ensure the viability of task transfer to a real robot as they did in the baseline VPG  experiments. It is our hope to demonstrate similar results on a physical testbed in the near future.
This material is based upon work supported by the NSF NRI Grant Award #1637949.
-  (2016-11) Modular Multitask Reinforcement Learning with Policy Sketches. ArXiv e-prints. External Links: Cited by: §II.
-  (2018) Playing hard exploration games by watching youtube. In Advances in Neural Information Processing Systems, pp. 2935–2945. Cited by: §II.
-  (2016) SE3-nets: learning rigid body motion using deep neural networks. arXiv preprint arXiv:1606.02378. Note: https://arxiv.org/abs/1606.02378 Cited by: §II.
-  (2017) Learning modular neural network policies for multi-task and multi-robot transfer. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2169–2176. Cited by: §II.
-  (2010) Model globally, match locally: efficient and robust 3d object recognition.. In CVPR, Vol. 1, pp. 5. Cited by: §II.
-  (2019) Neural architecture search: a survey.. Journal of Machine Learning Research 20 (55), pp. 1–21. External Links: Cited by: §II.
-  (2016-05) Unsupervised Learning for Physical Interaction through Video Prediction. ArXiv e-prints. External Links: Cited by: §II.
-  (2018) ShapeStacks: learning vision-based physical intuition for generalised object stacking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 702–717. Cited by: §II.
-  (2019) The costar block stacking dataset: learning with workspace constraints. Intelligent Robots and Systems (IROS), 2019 IEEE International Conference on. External Links: Cited by: §II, §III-A.
-  (2017) End-to-end learning of semantic grasping. In Conference on Robot Learning, pp. 119–132. External Links: Cited by: §II.
-  (2017-09) Robotic grasp detection using deep convolutional neural networks. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). External Links: Cited by: §II.
-  (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34 (4-5), pp. 705–724. Note: Dataset:http://pr.cs.cornell.edu/grasping/rect External Links: Cited by: §II.
-  (2016) Learning physical intuition of block towers by example. International Conference on Machine Learning, pp. 430–438. Cited by: §II.
-  (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §II, §II.
-  (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. Note: Dataset:https://sites.google.com/site/brainrobotdata/home External Links: Cited by: §II.
-  (2018-06) Closing the loop for robotic grasping: a real-time, generative grasp synthesis approach. Robotics: Science and Systems XIV. External Links: Cited by: §II.
-  (2014) Real-time grasp detection using convolutional neural networks. CoRR abs/1412.3128. External Links: Cited by: §II.
-  (2017) Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 640–651. External Links: Cited by: §III-A.
-  (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. External Links: Cited by: §II, §III-A.
-  (2018) PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes. In Robotics: Science and Systems (RSS), Vol. 14. External Links: Cited by: §II.
-  (2018) Neural task programming: learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §II.
-  (2019) TossingBot: learning to throw arbitrary objects with residual physics. arXiv preprint arXiv:1903.11239. Cited by: §II.
-  (2018) Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4238–4245. Cited by: §I, §II, §II, §II, §III-A, §III-B, TABLE I, §III, §III, §III, §IV-A, §IV-B, §IV-B, §IV, §V.
-  (2016) Modular deep q networks for sim-to-real transfer of visuo-motor policies. Australasian Conference on Robotics and Automation (ACRA) 2017. External Links: Cited by: §II.
-  (2018) A real-time robotic grasp approach with oriented anchor box. arXiv preprint arXiv:1809.03873. External Links: Cited by: §II.
-  (2018) Reinforcement and imitation learning for diverse visuomotor skills. In Robotics: Science and Systems XIV, Vol. 14. External Links: Cited by: §II.