Camera-to-Robot Pose Estimation from a Single Image
We present an approach for estimating the pose of a camera with respect to a robot from a single image. Our method uses a deep neural network to process an RGB image from the camera to detect 2D keypoints on the robot. The network is trained entirely on simulated data using domain randomization. Perspective--point (PP) is then used to recover the camera extrinsics, assuming that the joint configuration of the robot manipulator is known. Unlike classic hand-eye calibration systems, our method does not require an off-line calibration step but rather is capable of computing the camera extrinsics from a single frame, thus opening the possibility of on-line calibration. We show experimental results for three different camera sensors, demonstrating that our approach is able to achieve accuracy with a single frame that is better than that of classic off-line hand-eye calibration using multiple frames. With additional frames, accuracy improves even further. Code, datasets, and pretrained models for three widely-used robot manipulators will be made available.
Determining the pose of an externally mounted camera is a fundamental problem for robot manipulation. The resulting pose is necessary to transform measurements made in camera space to the robot’s task space. This transformation is essential for the robot to operate robustly in unstructured, dynamic environments, performing tasks such as object grasping and manipulation, human-robot interaction, and collision detection and avoidance.
The classic approach to calibrating an externally mounted camera is to fix a fiducial marker (e.g., ARTag  or AprilTag ) to the end effector, collect several images, then perform hand-eye calibration by solving , where and contain the robot and camera transformations, respectively, and is the unknown camera-to-robot transformation [3, 4]. This approach is still widely used due to its generality, flexibility, and the availability of open-source implementations in ROS. However, such an approach requires the somewhat cumbersome procedure of physically modifying the end effector, moving the robot to multiple joint configurations to collect a set of images, running an off-line calibration procedure, and removing the fiducial marker. Such an approach is not amenable to online calibration, because if the camera moves with respect to the robot, the entire calibration procedure must be repeated from scratch.
A more recent alternative is to avoid directly computing the camera-to-robot transform altogether, and instead to rely on an implicit mapping that is learned for the task at hand. For example, Tobin et al.  use deep learning to map RGB images to world coordinates on a table, assuming that the table at test time has the same dimensions as the one used during training. Levine et al.  learn hand-eye coordination for grasping a door handle, using a large-scale setup of real robots for collecting training data. In these approaches the learned mapping is implicit and task-specific, thus preventing the mapping from being applied to a new task.
We believe there is a need for a general-purpose tool that performs online camera-to-robot calibration from a single image. With such a tool, a researcher could set up a camera (e.g., on a tripod), and then immediately use object detections or measurements from image space for real-world robot control in a task-independent manner, without a separate offline calibration step. Moreover, if the camera subsequently moves for some reason (e.g., the tripod is bumped accidentally), there would be no need to redo the calibration step, because the online calibration process would automatically handle such disturbances.
In this paper we take a step in this direction by presenting a system for solving camera pose estimation from a single image. We refer to our framework as DREAM (for Deep Robot-to-camera Extrinsics for Articulated Manipulators). We train a robot-specific deep neural network to estimate prespecified keypoints in an RGB image. Combined with camera intrinsics and the known robot joint configuration, the camera extrinsics are then estimated using Perspective--Point (PP) . The network is trained entirely on synthetic images, relying on domain randomization  to bridge the reality gap. To generate these images, we augmented our open-source tool, NDDS , to allow for scripting robotic joint controls and to export metadata about specific 3D locations on a 3D mesh. We release pretrained models for three popular robotic manipulators: Franka Emika’s Panda, Kuka’s LBR iiwa 7 R800, and Rethink’s Baxter.
This paper makes the following contributions:
We demonstrate the feasibility of computing the camera-to-robot transformation from a single image without fiducial markers using a deep neural network trained only on synthetic data.
We show that the resulting accuracy with a single real image frame is better than that of classic hand-eye calibration using multiple frames and fiducial markers. Accuracy further improves with multiple image frames.
We trained the network on three different robot manipulators. Quantitative and qualitative results are shown for these robots using images taken by a variety of cameras.
Code, datasets, and pretrained models will be made available to the community upon acceptance of the paper.
Ii Previous Work
Object pose estimation. The problem of object pose estimation is vibrant within the robotics and computer vision communities [9, 10, 11, 12, 13, 14, 15, 16, 17]. Recent leading methods rely on an approach that is similar to the one proposed here: A network is trained to predict object keypoints in the 2D image, followed by PP  to estimate the pose of the object in the camera coordinate frame [9, 15, 18, 14, 17, 19]. Indeed, our approach is inspired by these methods. Interestingly, We follow the approach of Peng et al. , who showed that it is more effective to regress keypoints that are on the object than to regress vertices of an enveloping cuboid. Other methods have regressed directly to the pose [13, 16], but these methods bake in the camera intrinsics in the learned weights, although a different set of intrinsics can be applied via geometric postprocessing . In a related strand, researchers have used keypoint detection for human pose estimation [20, 21, 22, 23, 24]. Nevertheless, in robotics applications, it is not uncommon for objects to be detected via fiducial markers [25, 26, 27].
Robotic camera extrinsics. Closely related to the problem of estimating the camera-to-object pose (just described above) is that of estimating the camera-to-robot pose. The classic solution to this problem is known as hand-eye calibration, in which a fiducial marker (such as ARTag , AprilTag , or otherwise known object) is attached to the end effector and tracked through multiple frames. Using forward kinematics and multiple recorded frames, the algorithm solves a linear system to compute the camera-to-robot transform [28, 29, 30]. Similarly, an online calibration method is presented by Pauwels and Kragic , in which the 3D position of a known object is tracked through multiple frames, and nonlinear optimization is performed on the measurements.
Alternative approaches have also been explored, such as moving a small object on a table, moving the robot to point to each location in succession, and then using the forward kinematics to calibrate . The accuracy of such an approach degrades as the robot moves away from the table used for calibration. Aalerund et al.  present a method for calibrating an RGBD network of cameras with respect to each other for robotics applications, but the camera-to-robot transforms are not estimated.
For completeness, we note that, although our paper addresses the case of an externally mounted camera, another popular configuration is to mount the camera on the wrist , for which the classic hand-eye calibration approaches apply directly, with only slight adjustments required, see . Yet another configuration is to mount the camera on the ceiling pointing downward, for which simple 2D correspondences are sufficient [35, 36, 32].
Robotic pose estimation. Bohg et al.  explored the problem of markerless robot arm pose estimation. In this approach a random decision forest is applied to a depth image to segment the links of the robot, from which the robot joints are estimated. In followup work, Widmaier et al.  address the same problem but obviate the need for segmentation by directly regressing to the robot joint angles. Neither of these approaches estimate the camera-to-robot transform.
The most similar approach to ours is the recent work of Zuo et al., who also present a keypoint-based detection network, using synthetic data to train the detector. Rather than use PP, they leverage nonlinear optimization to directly regress to the camera pose as well as the unknown joint angles. To bridge the reality gap, their approach utilizes domain adaptation, which requires delicate manual annotations of real images. Their method is applied to a single low-cost manipulator with small reach.
In the problem we consider, there are three relevant coordinate frames, viz., that of the robot, of the camera, and of the image. An externally mounted camera observes keypoints on various robot links. These keypoints project onto the image as , . Some of these projections may be inside the camera frustum, whereas others may be outside. We consider the latter to be invisible and inaccessible, whereas the former are visible, regardless of occlusion. (Since we deal primarily with self-occlusion, the network learns to estimate the positions of occluded keypoints from the surrounding context.) The intrinsics relating the camera and image frames are assumed known.
Our two-stage process for solving the problem of camera-to-robot pose estimation from a single RGB image frame is illustrated in Fig. 1. First, an encoder-decoder neural network is used to produce a set of belief maps, one per keypoint. Then, Perspective--Point (PP)  uses the peaks of these 2D belief maps, along with the forward kinematics of the robot and the camera intrinsics, to compute the camera-to-robot pose, .
Iii-a Network Architecture
Inspired by recent work on object pose estimation [12, 15, 14], we use an auto-encoder network to detect the keypoints. The neural network takes as input an RGB image of size , and outputs belief maps in the form , where , . The output for each keypoint is a 2D belief map, where pixel values represent the likelihood that the keypoint is projected onto that pixel.
The encoder consists of the convolutional layers of VGG-19  pretrained on ImageNet. We also experimented with a ResNet-based encoder, viz., our reimplementation of ; The decoder (upsampling) component is composed of four 2D transpose convolutional layers (stride = 2, padding = 1, output padding = 1), and each layer is followed by a normal convolutional layer and ReLU activation layer. For the ResNet version, we follow the details in : batch normalization, upsampling layers, and so forth.
The output head is composed of 3 convolutional layers (, stride = 1, padding = 1) with ReLU activations with 64, 32, and channels, respectively. There is no activation layer after the final convolutional layer. The network is trained using an L2 loss function comparing the output belief maps with ground truth belief maps, where the latter are generated using pixels for generating the peaks.
Iii-B Pose Estimation
Given the 2D keypoint coordinates, robot joint configuration with forward kinematics, and camera intrinsics, PP  is used to retrieve the pose of the robot, similar to [9, 17, 12, 15, 14]. The keypoint coordinates are calculated as a weighted average of values near thresholded peaks in the output belief maps (threshold = 0.03), after first applying Gaussian smoothing to the belief maps to reduce the effects of noise. The weighted average allows for subpixel precision.
Iii-C Data Generation
The network is trained only using synthetic data with domain randomization (DR). Despite the traditional challenges with bridging the reality gap, we find that our network generalizes well to real-world images, as we will show in the experimental results. To generate the data we used our open-source NDDS tool , which is a plugin for the UE4 game engine. We augmented NDDS to export 3D/2D keypoint locations and robot joint angles, as well as to allow control of the robot joints.
The synthetic robot was placed in a simple virtual 3D scene in UE4, viewed by a virtual camera. Various randomizations were applied: 1) The robot’s joint angles defined roughly according to the joint limits. 2) The camera was positioned freely in a somewhat truncated hemispherical shell around the robot; with azimuth ranging from ° to ° (excluding the back of the robot), elevation from ° to 75°, and distance from 75 cm to 120 cm; the optical axis was also randomized within a small cone. These values which are presented for Panda vary slightly for the other robots. 3) Three scene lights were positioned and oriented freely while randomizing both intensity and color. 4) The scene background was selected from the COCO dataset . 5) 3D objects from the YCB dataset  were randomly placed, similar to flying distractors . 6) Random color tint was applied to the robot mesh.
Figure 2 shows representative images from synthetic datasets generated by our approach. Domain randomized (DR) datasets were used for training and test; datasets without domain randomization (non-DR) were used for testing to assess sim-to-sim generalization.
Iv Experimental Results
In this section we evaluate our DREAM method both in simulation and in the real world. We seek to answer the following questions: 1) How well does our training paradigm transfer to real-world data? 2) What is the impact of various network architectures? 3) What accuracy can be achieved with our system, and how does it compare to traditional hand-eye calibration?
Iv-a Datasets & Metrics
We collected several real-world datasets in our lab using various RGBD sensors. Since our DREAM algorithm processes only RGB images, the depth images were captured solely to facilitate ground truth camera poses via DART , a depth-based articulated tracking algorithm. During data collection, DART was monitored in real-time via RViz to ensure tracking remained stable and correct, by viewing the projection of the robot coordinate frames onto the camera’s RGB images. DART was initialized manually to ensure proper convergence.
Panda-3Cam. The purpose of this dataset is to study the effect of robot pose. The camera was placed on a tripod aimed at a Franka Emika Panda robot arm. The robot was moved to five different joint configurations, at which the camera collected data for approximately five seconds each, followed by manual teleoperation to provide representative end effector trajectories along cardinal axes, as well as along a representative reaching trajectory. During data collection, neither the robot base nor the camera moved. The entire capture was repeated using three different cameras utilizing different modalities: Microsoft XBox 360 Kinect (structured light), Intel RealSense D415 (active stereo), and Microsoft Azure Kinect (time-of-flight). All images were recorded natively at at 30 fps, except for the Azure Kinect, which was collected at and downsampled to via bicubic interpolation. This dataset consists of 4907, 5945, and 6395 image frames, respectively, for the three cameras, leading to a total of 17307 frames.
Panda-Orb. To evaluate the method’s ability to handle a variety of camera poses, we captured more data of the Franka Emika Panda. The Realsense camera was again placed on a tripod, but this time at 27 different positions (namely, 9 different azimuths, ranging from roughly -180 to +180, and for each azimuth two different elevations approximately 30 and 45, along with a slightly closer depth at 30). For each camera pose, the robot was commanded using RMPs [45, 46] to perform the same joint motion sequence of navigating between 10 waypoints defined in both task and joint space. The dataset consists of approximately 40k image frames.
We compute metrics on both 2D and 3D. For the 2D evaluation, we calculate the percentage of correct keypoints (PCK)  below a certain threshold, as the threshold is varied. All keypoints whose ground truth is within the camera’s frustum are considered, regardless of whether they are occluded. For 3D evaluation, we calculate the average distance (ADD) [10, 13], which is the average Euclidean distance of all 3D keypoints to their transformed versions, when the estimated camera pose is used for the transform. As with PCK, the percentage of keypoints with ADD lower than a threshold is calculated, as the threshold is varied. For ADD, all keypoints are considered. In both cases, averages are computed over keypoints over all image frames.
Iv-B Training and Simulation Experiments
For comparison, we trained three versions of our DREAM network. The architectures use either a VGG- or ResNet-based encoder, where the latter is our reimplementation of that in ; and the decoder outputs either full (F), half (H), or quarter (Q) resolution. Each neural network was trained for 50 epochs using Adam  with 1.5e-4 learning rate and 0.9 momentum. Training used approximately 100k synthetic DR images. The best-performing weights were selected according to a synthetic validation set.
As a baseline, Fig. 4 compares these versions on two simulated datasets, one with domain-randomization (DR) and the other without (non-DR). The improvement due to increasing resolution is clear; different architectures, on the other hand, have less impact for most scenarios.
Iv-C Real-world Experiments
Results comparing the same three versions on the Panda-3Cam dataset are shown in Fig. 4. Encouragingly, these results show that our training procedure is able to bridge the reality gap: There is only a modest difference between the best performing network on simulated and real data.
For 3D, it is worth noting that the standard deviation of ground truth camera pose from DART was 1.6 mm, 2.2 mm, and 2.9 mm, respectively, for the XBox 360 Kinect (XK), RealSense (RS), and Azure Kinect (AK) cameras, respectively. Note that the degraded results for the Azure Kinect are due to DART’s struggles in dealing with the time-of-flight-based depth, rather than to DREAM itself. On the other hand, the degraded results for XBox 360 Kinect are due to the poor RGB image quality from that sensor.
Ultimately, the goal is to be able to place the camera at an arbitrary pose aimed at a robot, and calibrate automatically. To measure DREAM’s ability to achieve this goal, we evaluated the system on the Panda-Orb dataset containing multiple camera poses. These results, alongside those of the previous experiments, are shown in Tab. I. For this table we used DREAM-vgg-F, since the other architectures behave similarly as before.
|PCK @ (pix)||ADD @ (mm)|
We also trained DREAM on the Kuka LBR iiwa and Baxter robots. The former is similar to the Panda. The latter, however, is more difficult: due to symmetry, the two arms look similar to one another and are easily confused, and we therefore had to restrict the azimuth range from -45 to +45. We did not perform quantitative analysis on either robot due to various hardware issues, but we found qualitatively that the approach works about the same for these robots as it does for the Panda. The detected keypoints overlaid on images of the three robots, using three different RGB cameras, are shown in Fig. 5. Although keypoints could in principle be anywhere on the robot, we simply assign keypoints to the joints according to the robot URDF.
Iv-D Comparison with Hand-Eye Calibration
The goal of our next experiment was to assess the accuracy of DREAM versus traditional hand-eye calibration (HEC). For the latter, we used the easy_handeye ROS package111https://github.com/IFL-CAMP/easy_handeye while tracking an ArUco fiducial marker  attached to the Panda robot hand.
The XBox 360 Kinect was mounted on a tripod, and the robot arm was moved to a sequence of different joint configurations, stopping at each configuration for one second to collect data. Neither the camera nor the base of the robot moved. The fiducial marker was then removed from the hand, and the robot arm was moved to a different sequence of joint configurations. The joint configurations were selected favorably to ensure that the fiducial markers and keypoints, respectively, were detected in the two sets of images. As before, DART with manual initialization was used for ground truth.
Although our DREAM approach works with just a single RGB image, it can potentially achieve greater accuracy with multiple images by simply feeding all detected keypoints (from multiple frames) to a single PP call. Thus, to facilitate a direct comparison with HEC, we applied DREAM to images from the set of images that were collected. Similarly, we applied HEC to images from the set. Both algorithms were then evaluated by comparing the estimated pose with the ground truth pose via ADD. For both algorithms, we selected possible combinations when evaluating the algorithm on images, to allow the mean, median, min, and max to be computed. To avoid unnecessary combinatorial explosion, whenever this number exceeded , we randomly selected combinations rather than exhaustive selection.
Results of this head-to-head comparison are shown in Fig. 6. Note that HEC is unable to estimate the camera pose when , whereas DREAM works with just a single image. As the number of images increases, the estimated pose from both DREAM and HEC improves, depending somewhat on the robot configurations used. In all cases, DREAM performs as well or better than the standard approach of HEC.
Iv-E Measuring Workspace Accuracy
In this experiment we evaluated the accuracy of DREAM’s output with respect to the workspace of the robot. The RealSense camera was placed on a tripod facing the Panda robot reaching toward an adjustable-height table on which were placed five fiducial markers. A head-to-head comparison of the camera poses computed by DART, DREAM, and HEC was conducted by commanding the robot to move the end effector to each of ten locations (5 markers, 2 table heights). The Euclidean distance between the end effector’s position in 3D and the desired position (which was 10 cm directly above each marker, to avoid potential collision) was then measured for each algorithm. Note that in this experiment DART was not considered to be ground truth, but rather was compared against the other methods.
Results are shown in Tab. II. Even though DREAM is RGB-only, it performs favorably not only to HEC but also to the depth-based DART algorithm. This is partly explained by the fact that the extrinsic calibration between the depth and RGB cameras is not perfect. Note that DREAM’s error is similar to that of our previous work , where we showed that an error of approximately 2 cm for object pose estimation from RGB is possible, and is sufficient for reliable grasping of household objects.
|min error (mm)||10.1||9.4||20.2|
|max error (mm)||44.3||51.3||34.7|
|mean error (mm)||21.4||28.2||27.4|
|std error (mm)||12.3||14.2||4.7|
We have presented a deep neural network-based approach to compute the extrinsic camera-to-robot pose using a single RGB image. To our knowledge, ours is the first system with such capability. Compared to traditional hand-eye calibration, we showed that our DREAM method achieves better accuracy even though it does not use fiducial markers or multiple frames. Nevertheless, with additional frames, our method is able to reduce the error even further with only a trivial modification. We have presented quantitative results on a robot manipulator using images from three different cameras, and we have shown qualitative results on other robots using other cameras. We believe the proposed method takes a significant step toward robust, online calibration. Future work should be aimed at filtering results over time, computing uncertainty, and incorporating the camera pose into a closed-loop grasping or manipulation task.
We gratefully acknowledge Kevin Zhang and Mohit Sharma (Carnegie Mellon University) for their help. Many thanks also to Karl van Wyk, Clemens Eppner, Chris Paxton, Ankur Handa, and Erik Leitch for their help.
-  M. Fiala, “ARTag, a fiducial marker system using digital techniques,” in CVPR, 2005.
-  E. Olson, “AprilTag: A robust and flexible visual fiducial system,” in ICRA, 2011.
-  R. Y. Tsai and R. K. Lenz, “A new technique for fully autonomous and efficient 3d robotics hand/eye calibration,” IEEE Transactions on robotics and automation, vol. 5, no. 3, pp. 345–358, 1989.
-  I. Fassi and G. Legnani, “Hand to sensor calibration: A geometrical interpretation of the matrix equation =,” Journal on Robotics Systems, vol. 22, no. 9, pp. 497–506, 2005.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in IROS, 2017.
-  S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 421–436, 2018.
-  V. Lepetit, F. Moreno-Noguer, and P. Fua, “EPnP: An accurate O(n) solution to the PnP problem,” International Journal Computer Vision, vol. 81, no. 2, 2009.
-  T. To, J. Tremblay, D. McKay, Y. Yamaguchi, K. Leung, A. Balanon, J. Cheng, and S. Birchfield, “NDDS: NVIDIA deep learning dataset synthesizer,” 2018, https://github.com/NVIDIA/Dataset_Synthesizer.
-  J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield, “Deep object pose estimation for semantic robotic grasping of household objects,” in CoRL, 2018.
-  S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes,” in ACCV, 2012.
-  T. Hodaň, P. Haluza, Š. Obdržálek, J. Matas, M. Lourakis, and X. Zabulis, “T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects,” in WACV, 2017.
-  S. Zakharov, I. Shugurov, and S. Ilic, “DPOD: Dense 6D pose object detector in RGB images,” arXiv preprint arXiv:1902.11020, 2019.
-  Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes,” in RSS, 2018.
-  Y. Hu, J. Hugonot, P. Fua, and M. Salzmann, “Segmentation-driven 6D object pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3385–3394.
-  S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel-wise voting network for 6dof pose estimation,” in CVPR, 2019.
-  M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel, “Implicit 3d orientation learning for 6d object detection from rgb images,” in ECCV, 2018.
-  B. Tekin, S. N. Sinha, and P. Fua, “Real-time seamless single shot 6D object pose prediction,” in CVPR, 2018.
-  D. J. Tan, N. Navab, and F. Tombari, “6D object pose estimation with depth images: A seamless approach for robotic interaction and augmented reality,” in arXiv 1709.01459, 2017.
-  A. Dhall, D. Dai, and L. V. Gool, “Real-time 3D traffic cone detection for autonomous driving,” in arXiv:1902.02394, 2019.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in CVPR, 2016.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in CVPR, 2017.
-  B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 466–481.
-  W. Li, Z. Wang, B. Yin, Q. Peng, Y. Du, T. Xiao, G. Yu, H. Lu, Y. Wei, and J. Sun, “Rethinking on multi-stage networks for human pose estimation,” arXiv preprint arXiv:1901.00148, 2019.
-  K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in CVPR, 2019.
-  C. Liu and M. Tomizuka, “Robot safe interaction system for intelligent industrial co-robots,” in arXiv:1808.03983, 2018.
-  C. H. Kim and J. Seo, “Shallow-depth insertion: Peg in shallow hole through robotic in-hand manipulation,” in ICRA, 2019.
-  N. Tian, A. K. Tanwani, J. Chen, M. Ma, R. Zhang, B. H. K. Goldberg, and S. Sojoudi, “A fog robotic system for dynamic visual servoing,” in ICRA, 2019.
-  F. C. Park and B. J. Martin, “Robot sensor calibration: solving AX= XB on the Euclidean group,” IEEE Transactions on Robotics and Automation, vol. 10, no. 5, pp. 717–721, 1994.
-  J. Ilonen and V. Kyrki, “Robust robot-camera calibration,” in Proceedings, International Conference on Advanced Robotics, 2011.
-  D. Yang and J. Illingworth, “Calibrating a robot camera,” in BMVC, 1994.
-  K. Pauwels and D. Kragic, “Integrated on-line robot-camera calibration and object pose estimation,” in Proceedings, IEEE International Conference on Robots and Automation, 2016.
-  D. Park, Y. Seo, and S. Y. Chun, “Real-time, highly accurate robotic grasp detection using fully convolutional neural networks with high-resolution images,” in arXiv:1809.05828, 2018.
-  A. Aalerud, J. Dybedal, and G. Hovland, “Automatic calibration of an industrial RGB-D camera network using retroreflective fiducial markers,” Sensors, vol. 19, 03 2019.
-  D. Morrison, P. Corke, and J. Leitner, “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” in RSS, 2018.
-  A. Feniello, H. Dang, and S. Birchfield, “Program synthesis by examples for object repositioning tasks,” in IROS, 2014.
-  J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” in RSS, 2017.
-  J. Bohg, J. Romero, A. Herzog, and S. Schaal, “Robot arm pose estimation through pixel-wise part classification,” in ICRA, 2014, pp. 3143–3150.
-  F. Widmaier, D. Kappler, S. Schaal, and J. Bohg, “Robot arm pose estimation by pixel-wise regression of joint angles,” in ICRA, 2016, pp. 616–623.
-  Y. Zuo, W. Qiu, L. Xie, F. Zhong, Y. Wang, and A. L. Yuille, “Craves: Controlling robotic arm with a vision-based economic system,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4214–4223.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
-  T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in CVPR, 2014.
-  B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar, “The YCB object and model set: Towards common benchmarks for manipulation research,” in Intl. Conf. on Advanced Robotics (ICAR), 2015.
-  J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep networks with synthetic data: Bridging the reality gap by domain randomization,” in CVPR Workshop on Autonomous Driving (WAD), 2018.
-  T. Schmidt, R. A. Newcombe, and D. Fox, “Dart: Dense articulated real-time tracking.” in Robotics: Science and Systems, 2014.
-  N. D. Ratliff, J. Issac, D. Kappler, S. Birchfield, and D. Fox, “Riemannian motion policies,” in arXiv:1801.02854, 2018.
-  C.-A. Cheng, M. Mukadam, J. Issac, S. Birchfield, D. Fox, B. Boots, and N. Ratliff, “Rmpflow: A computational graph for automatic motion policy generation,” in WAFR, 2018.
-  J. Tremblay, T. To, A. Molchanov, S. Tyree, J. Kautz, and S. Birchfield, “Synthetically trained neural networks for learning human-readable plans from real-world demonstrations,” in ICRA, 2018.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
-  S. Garrido-Jurado, R. M. noz Salinas, F. J. Madrid-Cuevas, and M. J. Marín-Jiménez, “Automatic generation and detection of highly reliable fiducial markers under occlusion,” Pattern Recognition, vol. 47, no. 6, pp. 2280–2292, 2014.