Contextual Reinforcement Learning of
Visuo-tactile Multi-fingered Grasping Policies
Using simulation to train robot manipulation policies holds the promise of an almost unlimited amount of training data, generated safely out of harm’s way. One of the key challenges of using simulation, to date, has been to bridge the reality gap, so that policies trained in simulation can be deployed in the real world. We explore the reality gap in the context of learning a contextual policy for multi-fingered robotic grasping. We propose a Grasping Objects Approach for Tactile (GOAT) robotic hands, learning to overcome the reality gap problem. In our approach we use human hand motion demonstration to initialize and reduce the search space for learning. We contextualize our policy with the bounding cuboid dimensions of the object of interest, which allows the policy to work on a more flexible representation than directly using an image or point cloud. Leveraging fingertip touch sensors in the hand allows the policy to overcome the reduction in geometric information introduced by the coarse bounding box, as well as pose estimation uncertainty. We show our learned policy successfully runs on a real robot without any fine tuning, thus bridging the reality gap.
status=draft, theme=color \definecolorfxtargetrgb0.8000,0.0000,0.0000 \definecolorfxnotergb0.0000,0.0000,0.8000 \definecolorMyBluergb0,0.2,0.8
Enabling robots to autonomously grasp object of varying shape and size with multi-fingered hands stands as a fundamental challenge necessary to produce more general manipulation skills such as pick-and-place tasks, human handover, and dexterous tool use. Classical solutions to this problem take a model-based planning and control approach. A typical pipeline estimates the object pose, given either a 3D point cloud or mesh of the object, then plans a set of contact locations and hand configuration to define the grasp, and finally generates a motion plan to reach and grasp the object. Such systems are sensitive to perception and calibration errors and often require significant computational time to plan and execute . Such issues might cause the system to misbehave and fail to grasp the object.
In this work we propose to overcome these constraints by learning a policy to grasp objects of varying geometry and scale with a multi-finger gripper using deep reinforcement learning (RL). A few important challenges arise in formulating the multi-fingered grasping problem as an RL problem. First, how to cope with the relatively high dimension of the multi-fingered hand’s configuration space in order to effectively explore the space of possible grasping policies? Second, how should the learner represent the object to be grasped in a way that can effectively generalize across objects of varying shape, while still being succinct enough to train efficiently? Third, how can we learn such a policy purely in simulation with no need to fine tune the policy for use in the physical world?
In order to efficiently search over the high-dimensional space of grasping policies, we leverage recent advancements in camera-based human hand pose estimation  and imitation learning  to provide human grasping demonstrations from an RGB camera. We use these grasping demonstrations as a component in our reward function, providing a prior for preferred grasping trajectories to the learner in simulation.
We address the problems of object representation and sim-to-real transfer by proposing a bounding-box based object representation. We extract the location of the 8 vertices of the cuboid enveloping the object to provide the object’s pose, general shape, and size as a context variable to the policy. Using these keypoints explicitly as a context variable and training over a variable set of object shapes enables our policy to adapt to different block-shaped objects upon deployment without the need for further training.
However, this does not enable the robot to robustly compensate for object geometries, such as cylinders or cones, not tightly captured by bounding box. As such we additionally make use of tactile sensing to provide contact information as part of the robot’s state. This enables the policy to learn that making–and maintaining–contact is necessary for grasping. This has the further benefit of aiding in bridging the sim-to-real gap, where tactile sensors on the physical robot compensate not only for object shape mismatch but also localization and calibration error from visual sensing. We deploy our final learned policy onto a real world system where visual input to the policy comes from an RGB pose estimator  and the contact information is retrieved from BioTac tactile sensors.
Our approach differs from many recent RL for sim-to-real tasks which attempt to overcome poor parameterization of the system dynamics or object and environment appearance by learning policies robust to high variation in visual sensing [34, 35, 29]. We take an alternative approach of abstracting away the uncertain object appearance and geometry into a succinct set of geometric features. To account for the coarse approximation these features induce, we leverage tactile sensors in the robot’s fingertips to observe contacts explicitly as part of our state. This differs also from standard approaches to grasp learning where richer visual features are leveraged to understand the object geometry at a relatively high resolution; where these features are either learned [37, 19, 21] or hand-crafted [25, 14].
This work makes the following contributions:
We present a system that leverages human demonstrations of grasping, reinforcement learning and sim-to-real to accomplish a multi-finger grasp task on a real-world system. We demonstrate that our system generalizes to unseen shapes in the real-world without any fine tuning.
We introduce a novel approach to fusing visual and tactile information in learned grasp policies, using 3D keypoints for context variables encoding object shape and binary contact signals within our object state. This allows our policy to reason about the object size and orientation implicitly creating a versatile policy that can adapt locally by leveraging the sensed contact information.
We provide empirical results demonstrating that benefits of our various contributions. We show that our keypoint representation coupled with tactile feedback can successfully grasp objects of varying shape not seen in training. We additionally quantify the benefits of using human hand grasping demonstration motions in learning a multi-fingered grasping policy. We show that our learned policy achieves comparable results to a hand-engineered policy on a real-word, physical robot without any fine tuning. We further demonstrate the ability to grasp with varying grasp styles simply by changing the human demonstrations provided during training. We will release our dataset of captured human hand motions used to teach our robot to grasp with style upon publication.
We now present the details of our approach to learning grasping policies for multi-fingered hands. We begin with a brief background of contextual policy search for reinforcement learning. We then give the specifics of how we encode the grasping problem into this contextual policy search framework. Following this we discuss how we learn policies informed from demonstration using RL. We conclude the section by describing how the policy is deployed on the physical robot.
Ii-a Background: Contextual Policy Search
We formulate the task of multi-finger grasping as a contextual policy search problem . This differs from the classic Markov Decision Process (MDP)  in that the agent (robot) observes some context variable at the beginning of the episode which parameterizes the reward function ; where and define the state action spaces respectively. The objective of the contextual policy search problem remains the same as standard reinforcement learning, namely to find a policy , that maximizes the expected accumulated reward, conditioned on the observed context :
where , , and . The remaining components of the MDP also exist in our problem formulation, specifically is the transition function, is the initial state distribution and is the discount factor. We additionally make explicit the policy parameters which we seek to learn through roll-outs of the system.
Ii-B Grasping as Contextual Policy Search
We define the context variables, , for our multi-fingered grasping problem as the keypoints of a bounding box surrounding the object of interest at its pose at the beginning of the episode (see Fig. 1). This defines a low dimensional feature representation to encode the object geometry, there are several ways to infer these features at runtime such as using pose estimation of known objects . By providing this information of the object’s pose only at the beginning of the trial, we remove the need to explicitly track the object during the execution. We believe this to be an advantage as stably tracking the object, even when a known model exists, remains challenging, because of the inevitable (partial) occlusion of the object caused by the hand interacting with it. Since the initial estimate may be inaccurate and the object will likely move during execution, we provide binary contact information for each robot fingertip as part of the robot’s state space.
In simulation we can directly observe contacts using the model of the robot and object. On the physical system we estimate contact using the pressure sensors of the BioTac sensors embedded in each fingertip. In addition to localizing the object, we hypothesize that contact information provides an extremely useful signal in learning stable grasps that can generalize across different objects geometries. The state space includes the Cartesian palm location denoted by and orientation all defined in the robot base frame, joint positions and velocities of the 16 DOF four-fingered hand represented as () and () and contact vector which contains binary contact information about the four fingertips (). This results in final state space of dimension . The context variable is dimensional, it contains the Cartesian () locations of each corner of a cuboid in the robot base frame. We define the robot action space as the desired Cartesian hand pose and the desired joint positions of the fingers. As such our action space has 22 dimensions.
Ii-C Reward Function
The task of reaching and grasping a wide range of objects with a multi-fingered hand is not trivial and as such we introduce reward terms to overcome several different challenges. We present each reward term in turn below; we define the final reward as the sum of these terms with weights selected such that each component has relatively equal scale.
Hand location with respect to the object. The first reward component encourages moving the palm of the hand close enough to the object to enable contact. Assuming a valid object pose estimate, keypoint locations of the object are computed in the robot base frame. We use the average of the 4 keypoint locations on the top surface of the object, denoted , to compute the following reward:
Hand motion. The second reward component serves to focus the policy search on likely to work motions in order to overcome the relatively high-dimensional configuration space of multi-fingered hands (16 DOF for our Allegro hand). To tackle this issue, we use human demonstrations, captured from a hand pose estimator , as useful prior information for policy learning. This, however, introduces another concern as the kinematic structure of the human hand is different from the robot’s. Since we know the values of the kinematic link lengths of the Allegro hand and the human hand from which demonstrations are generated, we perform a simple re-scaling of the data to fit the robot hand dimensions. In addition, we only reward the policy when the robot’s fingertip locations track the fingertip locations obtained from the human hand pose estimator . The purpose of the demonstrations is not to provide an accurate trajectory for the fingers to follow, but to reduce the search space of the policy.
Task success: Once the robot grasps the object, we reward the policy if it can successfully lift the object to a position, , above its starting location, :
Contact. Our reward function also encourages the robot to make fingertip contact with the object. We hypothesize that contact information greatly improves the ability to learn a stable grasping policy across objects of varying size and geometry. Here we define variable to have value 1 if fingertip is in contact and 0 otherwise:
The goal of our control policy is to generalize to objects of different geometry. The structure of our reward function with multiple terms reflect this goal, e.g., touch sensing and cuboid keypoints. In our experiments, we found that a binary/sparse reward for a task involving a multi-fingered robot to reach and grasp an object is not feasible, the reward is too sparse to learn anything. We assume in our experimental set up that the hand starting location is near the object of interest.
Ii-D Training Details
We use the proximal policy optimization (PPO)  algorithm to learn the policy. We represent the policy as a multi-layered perceptron (MLP) with 2 hidden layers containing 128 neurons each. During training, at the beginning of each rollout we generate a new cuboid object with dimensions uniformly sampled from a pre-specified range, we estimate the keypoints of the object—sampled noise is added to the keypoint locations to simulate sensor noise present in the physical system—and pass them as context to the policy. The keypoint values then remain the same throughout that rollout. Since we wish to deploy the policy learned in simulation on a real robot, we apply domain randomization on the objects to account for the discrepancy between the simulator and physical world. In addition to keypoint location noise, we add uniform noise to the object mass, friction coefficients between the fingers and object, PD gains of the robot, and damping coefficients of the robot joints. The range of the uniform distribution was manually specified based on initial results on the robot. Our method takes about hrs of training time with four threads on an i7 collecting samples across iterations. These numbers are consistent across four different seeds.
Ii-E Keypoint Parameter Adaptation for Novel Geometries
A primary goal of our approach is to learn a policy that generalizes to objects of non-cuboid shapes not seen during training. In essence, a new object implies a new context for the policy. While we can use the bounding box of a novel object to extract the keypoints defining the context variables, we find that this does not work well for objects with shape that significantly differs from the bounding box. As such, we propose optimizing over the context variables in order to find values which will enable the pre-trained policy to succeed. Importantly, we remove the restriction that the keypoints define a recta-linear box allowing them to take any point in 3D.
Given a policy trained in simulation over a uniform distribution of contexts, when presented with a new object we fix the policy network and search over the context variables using CMA-ES. We initialize the keypoints using the object bounding box. We evaluate the objective function by running a rollout in simulation and provide the height reached by the object once lifted as a continuous reward for the planner to maximize. In each iteration, there are about 5 rollouts of the policy, which means that about 65-70 trajectories on the new object to fine tune the policy. This whole process takes about 20 min of compute time. We examine the benefit of this adaptation in Section III-B.
We evaluate our method both in simulation and on the real robot. In these experiments we answer the following overarching questions. First, how important is hand demonstration data to learn an effective policy? Second, how does including contact information change the effectiveness of the grasp? Third, how sensitive is the policy learning to the object feature representation? And fourth, can our policy successfully transfer to a real robot without adaptation?
As such this section is organized as follows: We first discuss the implications of our state representation and reward functions by comparing GOAT to different baselines. Then we quantify how parametrization search over our keypoint representation can improve the learned policy’s performance. In addition to these experiments, we also show that using our method we can grasp objects with 6 different styles and evaluate the effectiveness of the different grasp styles. We conclude this section by showing real-world experiments on the robot.
Iii-a Comparison Methods
In order to evaluate the proposed method, we compare it to three baselines:
Baseline 1. The policy does not use any contact information; we hypothesize that local contact information is important in adapting to non-cuboid shapes and for identifying stable grasps once the robot hand makes contact with the object.
Baseline 2. We include contact information, however, we do not reward the policy for tracking the human hand demonstrations—i.e., we set the weight in Eq. (2) to 0. We would like to test the importance of demonstration data in learning in this high-dimensional action space, which, combined with sparse nature of the reward, makes it a difficult reinforcement learning problem.
Baseline 3. We change the context variable to a single 6-DoF pose vector of the object’s center. This tests our hypothesis that using keypoint information as the context variable provides a coarse representation of the object geometry enabling the policy to adapt to objects of varying shape.
To compare the effectiveness of our method to that of the policies trained using the baseline methods we perform two different tests. First, we generate 100 random objects unseen by the policies during training and test grasps for each object from 5 random poses on the table. We compare the number of successful grasps out of these 500 resulting trials.
Figure 2 illustrates the number of successful grasps achieved by each method on different object types. Each bar represent the average of four different trained seed on a specific category. Object Database refers to the open source dataset of 3d objects grasp database  where we randomly used 20 objects. We can clearly see that our proposed method outperforms all the baselines for the different object types. Interestingly the baselines all perform somewhat similarly and thus suggesting that our method provides the most detailed information for accomplishing this task. We also show the learning curves for average reward achieved by each method during training in Figure 3 for the cuboid category. Learning curve results represent the average and variance over four different seeds. It is worth noting that the weighting of the reward function remains the same across all experiments.
Iii-B Parameter Adaptation experiments
In the previous experiment with unseen objects, we tested the trained policy with context parameters selected from the object bounding box provided by our simulator. We ran experiments to investigate the effect of keypoint adaptation approach presented in Section II-E. Figure 4 shows the improvement in grasp success rate after parameter adaptation for both cuboid and non-cuboid objects. Figure 5 illustrates how the optimization loss reduces during the parameter adaption process. It takes on average iterations of the CMA-ES to identify keypoint inputs that enable the policy to pick up novel objects.
Iii-C Grasping with Style
To leverage the hand pose data made available by the hand pose estimator we learn different grasping styles. For the purpose of this experiment we define a grasping style as simple motions that the robot has to follow, e.g., only using the thumb and the index finger for grasping. Figure 7 illustrates the grasp success rate of each of the different styles. As expected, two fingered grasps are not as successful as with three or four fingered grasps. The objects used for this test were a mixture of 50% cuboid and 50% non-cuboid shapes.
Iii-D Real Robot
The ultimate test for GOAT is whether the learned policy can be deployed onto a real world robot. We use an Allegro robotic hand with 4 BioTac sensors mounted on a 7-DoF Kuka LBR iiwa 7 R800 arm. We use the pressure sensing on the BioTac to detect contact, and it is quite sensitive. We use DOPE  to localize the object and generate its bounding box keypoint locations. We use the 5 objects DOPE can detect from the YCB dataset : cracker box, meat, mustard, soup, and sugar box. Other methods could be used here to fit a bounding box around the object, similar to , we could leverage point cloud sensing to fit a bounding box on points above the work surface assuming a non cluttered environment. During the experiment, the object was placed randomly within the robot’s workplace five times with a random in plane orientation between and , where means the object’s axis is aligned with the robot base. For each pose detection we sample a normal distribution with variance of 1 mm or 10 mm to perturb the object location. We consider a successful grasp if the object stays above the work surface for at least 5 seconds.
We compared our method against a handwritten grasping policy, denoted baseline. Our baseline simply moves to a position 6 cm above the estimated center of the object. Once it reaches this location, the hand begins closing its fingers towards the object. Each finger stops moving when it detects contact with the object. Once all fingers have touched the object the hand exerts more force on the object before lifting it up 7 cm.
|noise = 0.001||noise = 0.01|
Table I depicts our results, it shows that our method performs similarly to the baseline under different noise levels. The soup is quite a challenging object for performing a top grasp, we were surprised to see our method moving its finger in such a way that it was looking for the object and achieving stable grasp with the cylinder even though it was never trained on such physical object. Representative grasps generated by our policy for each object are shown in Figure 6.
Iv Related Work
Robotic grasping is normally approached either through analytical, model-based methods or data driven methods using either supervised or reinforcement learning. The former focuses on constructing grasps that satisfy specific conditions, e.g., gripper configuration, object contact points, force closure, task completion, etc. while modelling the robot universe based on 3D models, partial meshes, and dynamic kinematic models . Whereas the latter, learning-based methods, might learn from annotated datasets, or from the robot interacting with its environment [19, 40]. These learned grasping behaviors tend to generalize better to unseen objects and situations.
Reinforcement Learning (RL) has been gaining prominence for robotic manipulation in recent years; many of these works have focused on learning grasping, but the majority focus on the simpler 2D gripper problem [11, 33, 39, 40, 17, 28, 3, 8]. Andrychowicz et al. have trained a multi-finger robotic hand policy to repose a cube in-hand to match a desired pose . Similar to our work they leverage simulation to train a policy to be deployed in the real world, however they do not focus on grasping, instead assuming the object already rests in the robot’s hand.
The closest previous work to ours by Osa et al. also learns grasping policy for different grasping styles using reinforcement learning  initialized by human demonstrations. The grasping style is function of the surface mesh similarity to those seen during training and, as such, wont be able to enforce a specific style a priori.
Another work with similar goals to ours uses supervised learning, coupled with analytical planning, to plan multi-fingered grasps of different styles, i.e., precision and power . They achieve this by explicitly modeling the grasp style as a decision variable in the grasp optimization. Similar to previous robotics work [37, 22, 19, 18, 12], they learn a grasp success predictor from data. Given a grasp configuration they use the gradients from the predictor to refine the proposed grasp until it is predicted to be successful (has high probability). Once the grasp configuration is found, it gets executed by a planner. Our work differs from this as we seek to learn separate grasping policies for each grasp style from a single human hand demonstration without relying on any planning algorithms for grasp execution. Other supervised-learning works have focus on grasping objects using one shot learning to predict contact points .
Representation plays a very important role for learning in robotics manipulation. Choosing the right one will allow for completion of learning downstream tasks. Lee et al. proposed a method that learns an initial representation using unsupervised learning methods . Once the representation is learned they leverage the multi-sensing description to learn tasks using RL methods, such as, peg-in-hole insertion. Other work have explored using touching sensing to grasp objects under different assumptions, although very little work has been done on learning from using this sensor [9, 15, 7, 24, 2]. Manuelli et al. also leverage keypoint representation to learn an agnostic representation for a class of objects where a classical controller is written to accomplish a pick-and-place task . Other works have focus on learning the full 6D pose of known objects for robotics pick and place [36, 38]. Similar to [27, 5], our state representation also includes finger contact information to overcome shape and pose uncertainty; however, they rely on hand-tuned, model-based controllers for execution. We believe our approach to be the first to explore using visual keypoints coupled with tactile-feedback in order to learn grasping behaviors with RL.
V Conclusion and Discussion
We have presented a contextual policy search approach to learning policies for grasping unknown objects with multi-fingered hands using bounding box representation and contact sensing. We validate that our approach can train purely in simulation and be successfully deployed in the real world on a physical robot. We introduce the use of bounding box keypoints as a contextual representation for the reward and, in turn, the policy. We show that coupling this keypoint representation with contact sensing in the policy allows the robot to adapt to previously unseen shapes and overcome uncertainty in object pose estimation arising from noisy visual sensing. This allows our method to handle objects with shape deviating greatly from that of a bounding box (e.g. a cone) we can optimize over the context variables to enable greater grasping performance without needing to retrain our learned policy.
The authors would like to thank Karl Van Wyk for his amazing help for setting up the the robotics system. We would also like to thank Nathan Ratliff, Rowland O’Flaherty, Ankur Handa, and Clemens Eppner for their technical help with various challenges.
-  (2018) Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177. Cited by: §IV.
-  (2018) More than a feeling: learning to grasp and regrasp using vision and touch. IEEE Robotics and Automation Letters 3 (4), pp. 3300–3307. Cited by: §IV.
-  (2018) Review of deep learning methods in robotic grasp detection. Multimodal Technologies and Interaction 2 (3), pp. 57. Cited by: §IV.
-  (2015) The YCB object and model set. In IEEE Int. Conf. on Advanced Robotics, pp. 510–517. Cited by: §III-D.
-  (2015) An adaptive compliant multi-finger approach-to-grasp strategy for objects with position uncertainties. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 4911–4918. Cited by: §IV.
-  (2015) Synthesis and optimization of force closure grasps via sequential semidefinite programming. In Int. Symp. on Robot. Res., pp. 1–16. Cited by: §I.
-  (2014) Semantic grasping: planning task-specific stable robotic grasps. Autonomous Robots 37 (3), pp. 301–316. Cited by: §IV.
-  (2018) Learning task-oriented grasping for tool manipulation with simulated self-supervision. In Robotics Science and Systems, Cited by: §IV.
-  (2010) Contact-reactive grasping of objects with partial shape information. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1228–1235. Cited by: §IV.
-  (2018) Hand pose estimation via latent 2.5 d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 118–134. Cited by: §I, §II-C.
-  (2018) Residual reinforcement learning for robot control. arXiv preprint arXiv:1812.03201. Cited by: §IV.
-  (2015) Leveraging big data for grasp planning. In Proc. IEEE Int. Conf. Robot. Autom., pp. 4304–4311. Cited by: §III-A, §IV.
-  (2011) Reinforcement learning to adjust robot movements to new situations. In International Joint Conference on Artificial Intelligence, pp. 2650–2655. External Links: Cited by: §II-A.
-  (2016) One-shot learning and generation of dexterous grasps for novel objects. The International Journal of Robotics Research 35 (8), pp. 959–976. Cited by: §I, §IV.
-  (2012) Probabilistic sensor-based grasping. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2019–2026. Cited by: §IV.
-  (2018) Making sense of vision and touch: self-supervised learning of multimodal representations for contact-rich tasks. arXiv preprint arXiv:1810.10191. Cited by: §IV.
-  (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 37 (4-5), pp. 421–436. Cited by: §IV.
-  (2019) Generating grasp poses for a high-dof gripper using neural networks. arXiv preprint arXiv:1903.00425. Cited by: §IV.
-  (2017) Planning multi-fingered grasps as probabilistic inference in a learned deep network. In International Symposium on Robotics Research, External Links: Cited by: §I, §IV, §IV.
-  (2019) Modeling Grasp Type Improves Learning-Based Grasp Planning. IEEE Robotics and Automation Letters. Cited by: §III-D.
-  (2019) Modeling grasp type improves learning-based grasp planning. IEEE Robotics and Automation Letters 4 (2), pp. 784–791. Cited by: §I, §IV.
-  (2016) Dex-net 1.0: a cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In IEEE International Conference on Robotics and Automation (ICRA), pp. 1957–1964. Cited by: §IV.
-  (2019) KPAM: keypoint affordances for category-level robotic manipulation. arXiv preprint arXiv:1903.06684. Cited by: §IV.
-  (2015) Category-based task specific grasping. Robotics and Autonomous Systems 70, pp. 25–35. Cited by: §IV.
-  (2018) Hierarchical Reinforcement Learning of Multiple Grasping Strategies with Human Instructions. Advanced Robotics 32 (18), pp. 955–968. External Links: Cited by: §I, §IV.
-  (2018) Deepmimic: example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG) 37 (4), pp. 143. Cited by: §I.
-  (2010) Null-space grasp control: theory and experiments. IEEE Transactions on Robotics 26 (2), pp. 282–295. Cited by: §IV.
-  (2018) Deep reinforcement learning for vision-based robotic grasping: a simulated comparative evaluation of off-policy methods. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6284–6291. Cited by: §IV.
-  (2017-07) CAD2RL: real single-image flight without a single real image. pp. . External Links: Cited by: §I.
-  (2012) An overview of 3d object grasp synthesis algorithms. Robotics and Autonomous Systems 60 (3), pp. 326–336. Cited by: §IV.
-  (2017) Proximal policy optimization algorithms. ArXiv abs/1707.06347. Cited by: §II-D.
-  (1998) Reinforcement Learning : An Introduction. MIT Press. Cited by: §II-A.
-  (2018) Learning robotic assembly from cad. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §IV.
-  (2017) Domain randomization for transferring deep neural networks from simulation to the real world. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §I.
-  (2018) Domain randomization and generative models for robotic grasping. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3482–3489. Cited by: §I.
-  (2018) Deep object pose estimation for semantic robotic grasping of household objects. In Conference on Robot Learning (CoRL), External Links: Cited by: §I, §II-B, §III-D, §IV.
-  (2015) Generating multi-fingered robotic grasps via deep learning. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4415–4420. Cited by: §I, §IV.
-  (2019) DenseFusion: 6d object pose estimation by iterative dense fusion. arXiv preprint arXiv:1901.04780. Cited by: §IV.
-  (2019) Sim-to-real transfer for biped locomotion. arXiv preprint arXiv:1903.01390. Cited by: §IV.
-  (2018) Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4238–4245. Cited by: §IV, §IV.