Monocular 3D Object Detection via Geometric Reasoning on Keypoints
Abstract
Monocular 3D object detection is wellknown to be a challenging vision task due to the loss of depth information; attempts to recover depth using separate imageonly approaches lead to unstable and noisy depth estimates, harming 3D detections. In this paper, we propose a novel keypointbased approach for 3D object detection and localization from a single RGB image. We build our multibranch model around 2D keypoint detection in images and complement it with a conceptually simple geometric reasoning method. Our network performs in an endtoend manner, simultaneously and interdependently estimating 2D characteristics, such as 2D bounding boxes, keypoints, and orientation, along with full 3D pose in the scene. We fuse the outputs of distinct branches, applying a reprojection consistency loss during training. The experimental evaluation on the challenging KITTI dataset benchmark demonstrates that our network achieves stateoftheart results among other monocular 3D detectors.
\ul \setstretch1.1
1 Introduction
The success of autonomous robotics systems, such as selfdriving cars, largely relies on their ability to operate in complex dynamic environments; as an essential requirement, autonomous systems must reliably identify and localize nonstationary and interacting objects, e.g. vehicles, obstacles, or humans. In its simplest formulation, localization is understood as an ability to detect and frame objects of interest in 3D bounding boxes, providing their 3D locations in the surrounding space. Crucial to the decisionmaking process is the accuracy of depth estimates of the 3D detections.
Depth estimation could be approached from both hardware and algorithmic perspectives. On the sensors end, laser scanners such as LiDAR devices have been extensively used to acquire depth measurements sufficient for 3D detection in many cases [49, 3, 16, 51, 15, 48]. However, point clouds produced by these expensive sensors are sparse, noisy and massively increase memory footprints with millions of 3D points acquired per second. In contrast, imagebased 3D detection methods offer savings on CPU and memory consumption, use cheap onboard cameras, and work with a wealth of established detection architectures (e.g., [24, 35, 36, 19, 11, 20, 23]), yet they require sophisticated algorithms for depth estimation, as raw depth cannot be accessed anymore.
Recent research on monocular 3D object detection relies on separate dense depth estimation models [33, 46], but depth recovery from monocular images is naturally illposed, leading to unstable and noisy estimates. In addidion, in many practical instances, e.g., with sufficient target resolution or visibility, dense depth estimation might be redundant in context of 3D detection. Instead, one may focus on obtaining sparse but salient features, such as 2D keypoints, that are wellknown visual cues often serving as geometric constrains in various vision tasks such as human pose estimation [34, 26, 27, 28] and more general object interpretation [13, 43].
Motivated by this observation, in this paper we propose a novel keypointbased approach for 3D object detection and localization from a single RGB image. We build our model around 2D keypoint detection in images and complement it with a conceptually simple geometric reasoning framework, establishing correspondences between the detected 2D keypoints and their 3D counterparts defined on surfaces of 3D CAD models. The framework operates under the general assumptions, assuming the camera intrinsic parameters are given, and retrieves depth of closest keypoint instance, thereby ”lifting” 2D keypoints to 3D space; the remaining 3D keypoints and the final 3D detection are assembled in a similar way. Our approach does not require keypointannotated labeled images, but instead relies on a multitask reprojection consistency loss, allowing for robust 2D keypoint detection. Thus, our model is endtoend trainable.
In summary, our contributions are as follows:

We propose a novel deep learningbased framework for monocular 3D object detection, combining wellestablished regionbased detectors and a geometric reasoning step over keypoints.

We describe an endtoend training scheme for this framework, using a dataset of realworld images and a collection of 3D CAD models, annotated with 3D keypoints.
The rest of this paper is organized as follows. In Section 2, we review the related work on object detection, mostly in the context of selfdriving and robotics applications. In Section 3, we describe our proposed monocular 3D object detection approach, and in Section 4, its experimental evaluation using the standard KITTI benchmark. We conclude in Section 5 with a discussion of our results.
2 Related work
2D object detection.
2D object detection is an extensively studied vision task, with a body of research devoted to both algorithms [35, 24, 20, 36, 11, 23] and benchmarks [5, 6, 21, 8, 7]. Traditionally, object detectors operate in two stages, with the first stage selecting object candidates [41, 54, 36] and the second stage operating as a discriminator and refinement model, rejecting bad proposals [36, 11, 4]. Due to the introduction of novel backbone architectures [45] and losses [20], such approaches have attained top results in a number of benchmarks. In the context of robotics, singlestage detectors such as YOLO [35], SSD [24] and RetinaNet [20] are of particular interest, however, they offer inferior performance and are not straightforward to extend to related tasks such as instance segmentation [11, 23] and keypoint detection [11].
3D object detection.
Recently, novel deep learning architectures operating directly on unstructured point clouds have been proposed [31, 32, 42, 10, 14, 17], offering the possibility to develop corresponding 3D object detectors [30, 51, 47]. However, such approaches require expensive sensing equipment (LiDARs) and commonly process point cloud data coupled with RGB data. Some depthbased approaches operate over voxelgrid representations of the point clouds, leveraging the existing convolutional architectures [49, 16, 51, 48], while other methods fuse depth features with birdseyeview (BEV) and image features [3, 15, 30].
Monocular 3D object detection.
The most relevant to our work is research on monocular 3D object detection, that is wellknown to be a challenging vision task. Deep3DBox [29] relies on a set of geometric constraints between 2D and predicted 3D bounding boxes and reduces 3D object localization problem to a linear system of equations, fitting 3D box projections into 2D detections. Their approach relies on a separate linear solver; in contrast, our model is endtoend trainable and does not require external optimization. Mono3D [2] extensively samples 14K 3D bounding box proposals per image and evaluates each, exploiting semantic and imagebased features. In contrast, our approach does not rely on an exhaustive sampling in 3D space, bypassing a significant computational overhead. OFTNet [37] introduces an orthographic feature transform which maps RGB image features into a birdseyeview representation through a 3D scene grid, solving the perspective projection problem. However, backprojecting image features onto 3D grid results in a coarse feature assignment. Our approach detects 2D keypoints with sufficient precision, avoiding any additional discretization. MonoGRNet [33] directly deals with depth estimation from a single image, training an additional subnetwork to predict the coordinate of each 3D bounding box. [46] exploit a similar approach, estimating disparity using a standalone pretrained MonoDepth network [9]. Both these methods rely on the nontrainable depth estimation networks, which introduce a computational overhead; in contrast, our approach jointly estimates object 2D boundingbox and 3D pose in a fully trainable manner, not requiring a dense depth prediction.
Perhaps, the most similar approach to ours is [1], which utilizes 3D CAD models, along with predicting 2D keypoints. However, their network only models 2D geometric properties and aims at matching the predictions to one of the CAD shapes, while 3D pose estimation is postponed for the inference step. They additionally exploit extensive annotations of keypoints in their 3D models. In contrast, we only annotate 14 keypoints per each of the five 3D models and exploit them in a geometric reasoning module to bridge the gap between 2D and 3D worlds, which allows us to deal with 3D characteristics during training in an endtoend manner.
Keypoints estimation and 3D representations.
Keypointbased representations are a common mechanism of encoding 3D geometric structure in objects, that have proven themselves as powerful visual cues for tasks such as pose estimation [34, 26, 27, 28], fine pose prediction [18], 3D reconstruction [38], shape alignment [18, 25], to name a few. A commonly used approach is to learn a set of keypoint detectors, followed by some postprocessing to assemble their predictions into a geometric model. However, obtaining sufficient groundtruth for training keypoint detectors is a challenging task. One may manually annotate 3D keypoints of objects in real images, but this is laborintensive and often inaccurate. Other directions involve active shape modeling [34, 22] and shape alignment with wireframe [52, 53, 50] and 3D CAD models [1]. For human pose estimation, another option could be motion capture of joint locations [26] Recently, latent modeling approaches have been proposed to learn optimal sets of keypoints without direct supervision [43, 39]. Our keypoint detection approach bears similarity to [1] as we utilize 3D CAD models and align them to sensor measurements, but offers labor savings since we only annotate 14 keypoints per each of the five CAD models.
3 3D Object Detection Framework
Given a single RGB image, our goal is to localize target objects in the 3D scene. To do this, we propose an endtoend trainable CNNbased framework that accepts a single RGB image as input and outputs a set of 3D detections. Each target object is defined by its class and 3D bounding box, parameterized by 3D center coordinates in a camera coordinate system, global orientation , and dimensions , standing for width, height and length, respectively (we don’t correct for truncation or occlusion when defining object sizes). We parameterize global object orientation with the yaw angle only, which is a commonly adopted premise when dealing with objects in the road scenes [29, 30].
Our proposed framework comprises two submodules, each of which operates on the characteristics living in either 2D (image) or 3D (world) space. From the 2D perspective, each object of interest, cropped by its predicted 2D bounding box, is provided with 2D keypoints and their respective visibility states. On the 3D side, object dimensions, 3D CAD model, and local orientation are predicted. The gap between the two spaces is bridged by the geometric reasoning module computing instance depth, global orientation, and the final 3D detection.
Our implementation takes advantage of the generality of the stateoftheart Mask RCNN architecture [11], viewing it as a universal backbone network extensible to adjacent problems, and complement it with three subnetworks: 2D object detection subnetwork, 2D keypoints regression subnetwork, and dimension regression subnetwork. The whole system represents an endtoend trainable network, depicted in Figure 1, with subnetworks initially trained independently, switching further to joint training via introduced multitask consistency reprojection loss function on the projected 3D keypoints and 3D bounding box corners.
2D object detection.
For 2D detection, we follow the original Mask RCNN architecture [11], which includes Feature Pyramid Network (FPN) [19], Region Proposal Network (RPN) [36] and RoIAlign module [11]. The RPN generates 2D anchorboxes with a set of fixed aspect ratios and scales throughout the area of the provided feature maps, which are scored for the presence of the object of interest and adjusted. The spatial diversity of the proposed locations is processed by the RoIAlign block, converting each feature map, framed by the region of interest, into a fixedsize grid, preserving an accurate spatial location through bilinear interpolation. Followed by fully connected layers, the network splits into two feature sharing branches for the bounding box regression and object classification. During training, we utilize smooth L1 and crossentropy loss for each task respectively, as proposed by [36]. Though we do not directly utilize the predicted 2D bounding boxes, we have experimentally observed the 2D detection subnetwork to stabilize training.
2D keypoint detection.
We predict coordinates and a visibility state for each of the manuallychosen 14 keypoints (c.f. Figure 3 for details on our choice of 3D keypoints). Unlike the parameterization suggested in [11, 40], we directly regress on 2D coordinates of keypoints. The visibility state, determined by the occlusion and truncation of an instance, is a binary variable, and no difference between occluded, selfoccluded and truncated states is made. Adding visibility estimation helps propagate information during training for visible keypoints only and acts as an auxiliary supervision for orientation subnetwork. During training, similar to our 2D object detection subnetwork, we minimize the multitask loss combining smooth L1 loss for coordinates regression and crossentropy loss for visibility state classification, defined as:
(1)  
where is the visibility indicator of th keypoint, while and denote groundtruth and predicted 2D coordinates, normalized and defined in a reference frame of a specific feature map after RoIalignment. Similarly, is the ground truth visibility status, while is the estimated probability that keypoint is visible.
3D dimension estimation and geometric classification.
To each annotated 3D instance in the dataset, we have assigned a 3D CAD model out of a predefined set of 5 templates, obtaining 5 distinct geometric classes of instances. Used templates are presented on Figure 2. The assignment has been made based on the width, length and height ratios only. For each geometric class, we have computed mean dimensions over all assigned annotated 3D instances.
During training the 3D dimension estimation and geometric class selection subnetwork, we utilize a multitask loss combining crossentropy loss (for the geometric class selection) and a smooth loss for dimension regression. Instead of regressing the absolute dimensions, we predict the differences from the mean dimensions in the logspace:
(2) 
where and represent the ground truth and predicted offsets to the class mean values along each dimension, respectively.
Reasoning about instance depth.
We define instance depth as the depth of a vertical plane passing through the two closest of visible keypoints, defined in the camera reference frame. To compute this depth value, we use predicted 2D keypoints, instance height (in meters), and its geometric class. First, we select two keypoints and in the image and compute their difference . We then select the corresponding two keypoints and in cad model reference frame and compute their height ratio . Finally, the distance to the object is defined from the pinhole camera model:
(3) 
where is a focal length of the camera, known for each frame. Figire 3 illustrates this computation. Depth coordinate allows to retrieve the remaining 3D location coordinates of one of the selected keypoints, using the backprojection mapping:
(4) 
where are the camera principal point coordinates in pixels.
Orientation estimation.
Direct estimation of orientation R in a camera reference frame is not feasible, as the region proposal network propagates the context within the crops solely, cutting off the relation of the crop to the image plane. Inspired by [29], we represent the global orientation as a combination of two rotations with azimuths defined as:
(5) 
where is the object’s local orientation within the region of interest, and is a ray direction from the camera to the object center, directly found from the 3D location coordinates. We estimate using a modification of the MultiBin approach [29]. Specifically, instead of splitting the objective into angle confidence and localization parts, we discretize the angle range from to degrees into 72 nonoverlapping bins and compute the probability distribution over this set of angles by a softmax layer. We train the local orientation subnetwork using crossentropy as a loss function. To obtain the final prediction for , we utilize the weighted mean of the bins medians (), adopting the softmax output as the weights. Given 3D location coordinates of one of the keypoints and the weighted mean local orientation , the global orientation is defined as follows:
(6) 
3D object detection.
To obtain the center C of the final 3D bounding box, we use the global orientation and the distance between the keypoint and the object center. For a particular CAD model, given the weight, height and length ratio between the selected keypoint and the object center estimated object dimensions D and global orientation R, the location C is predicted as
(7) 
where stands for an elementwise product. Depending on the selected keypoint position (left or right, back or front, top or bottom side of the object), a sign is chosen for each dimension.
Multihead reprojection consistency loss.
Except for shared convolutional backbone, each subnetwork is independent of its neighbors and unaware of other predictions, though the geometric components are strongly interrelated. To provide consistency between the network branches, we introduce a loss function which integrates all the predictions. 3D coordinates in a CAD model coordinate system of the keypoints from the set K are scaled using D, rotated using R, translated using C, and backprojected into the image plane via camera projection matrix to obtain 2D keypoint coordinates and compare with the ground truth values. A similar approach is applied to the eight corners of the 3D bounding box obtained from 3D detection and orientation estimates, to ensure that they fit tightly into the ground truth 2D bounding box after backprojection. In all cases, we use the smooth L1 loss during training.
4 Experiments
4.1 Experimental setup
Method  IoU = 0.5  IoU = 0.7  
Easy  Moderate  Hard  Easy  Moderate  Hard  
Mono3D  25.19  18.20  15.52  2.53  2.31  2.31 
OFTNet        4.07  3.27  3.29 
MonoGRNet  50.51  36.97  30.82  13.88  10.19  7.62 
MF3D  47.88  29.48  26.44  10.53  5.69  5.39 
Ours  48.81  30.17  20.07  11.91  6.64  4.28 
Ours (+loss)  50.82  31.28  20.21  13.96  7.37  4.54 
Dataset.
We train and evaluate our approach using the KITTI 3D object detection benchmark dataset. For the sake of comparison with stateoftheart methods, we follow the setup presented in [2], which provides 3712 and 3769 images for training and validation, respectively, along with the camera calibration data. To extend KITTI dataset with assignment of geometric classes using CAD models and keypoints 2D coordinates, we employ the approach and data provided in [44]. Depending on the ratios between height, length and width, each car instance is assigned with one out of 5 CAD model classes from a predefined set of CAD templates, presented on Figure 1. We manually annotated each CAD model with the keypoint locations. Figure 3 displays an example of the annotated keypoints, most of which are a common choice [6] due to their interpretability, such as the car’s edges, carcass, etc.; we also included corners of windshields to deal with the height of each instance. To obtain 2D coordinates of the keypoints, we backprojected CAD models from 3D space to the image plane using ground truth location, dimension and rotation values. Simultaneous projection of all 3D CAD models on a scene provides us with a depth ordering mask, allowing for defining the visibility state of each keypoint.
Network architecture.
We utilize Mask RCNN with a Feature Pyramid Network [19], based on a ResNet101 [12] as our backbone network for the multilevel feature extraction. Instead of the higher resolution and feature maps in the original architecture, we stack the same amount of kernels, followed by a fully connected layer to predict 2D normalized coordinates and visibility states for each of the 14 keypoints. From the same feature maps, we branch a fullyconnected layer predicting local orientation in bins of each, totaling 72 output units. The feature sharing between keypoints and local orientation was found crucial for network performance, as both characteristics imply similar geometric reasoning. In parallel to 2D detection and 2D keypoints estimation, we create a subnetwork of a similar architecture for dimension regression and classification into geometric classes. The remaining components, including RPN, RoIAlign, bounding box regression and classification heads, are implemented following the original Mask RCNN design. For instance depth retrieval we use only four pairs of keypoints: corners of the front and rear windows. Other keypoints are used for additional supervision in consistency loss calculation during training.
Training our model.
We set hyperparameters following Mask RCNN work [11]. The RPN anchor set covers five scales, adjusting them to the values of 4, 8, 16, 32, 64, and three default aspect ratios. Each minibatch consists of 2 images, producing 256 regions of interest, with a positive to negative samples ratio set 1:3, to achieve class sampling balance during training. Any geometric augmentations over the images are omitted, solely applying image padding to meet the network architecture requirements. ResNet101 is initialized with the weights pretrained on Imagenet [5], and frozen during further training steps. We first train the 2D detection and classification subnetwork for 100K iterations, adopting Adam optimizer with a learning rate of throughout the training, setting weight decay of 0.001 and momentum of 0.9. Then 2D keypoints and local orientation are trained for 50K iterations. Finally, enabling the multihead consistency loss, the whole network is trained in an endtoend fashion for 50K iterations. We combine losses from all of the network outputs, weighting them equally.
Evaluation metrics.
We evaluate the network under the conventional KITTI benchmark protocol, which enables comparison across approaches. Car category is the sole subject of our focus. By default, KITTI settings require evaluation in 3 regimes: easy, moderate and hard, depending on the instance difficulty of a potential detection. 3D bounding box detection performance implies 3D Average Precision (AP3D) evaluation, setting Intersection over Union (IoU) threshold to 0.5 and 0.7.
4.2 Experimental results
3D object detection.
We compare the performance with 4 monocular 3D object detection methods: Mono3D [2], OFTNet [37], MonoGRNet [33] and MF3D [46], which reported their results on the same validation set for the car class. We borrow the presented average precision numbers from their published results. The results are reported in Table 1. The experiments show that our approach outperforms stateoftheart methods on the easy subset by a small margin while remaining the second best on the moderate subset. This observation aligns with our intuition that visible salient features such as keypoints are crucial to the success of 3D pose estimation. For the moderate and hard images, 2D keypoints are challenging to robustly detect due to the high occlusion level or the low resolution of the instance. We also measure the effect of the reprojection consistency loss on our network performance, observing a positive effect of our loss function.
3D bounding box and global orientation estimation.
We follow the experiment presented in [33], evaluating the quality of the 3D bounding boxes sizes estimation, as well as the orientation in a camera coordinate system. The mean errors of our approach, along with [33, 2], borrowed from their work, are presented in Table 2.
Method  Size (m)  Orientation (rad)  
Height  Width  Length  
Mono3D  0.172  0.103  0.504  0.558 
MonoGRNet  0.084  0.084  \ul0.412  0.251 
Ours  0.115  0.107  0.516  \ul0.215 
Ours (+loss)  \ul0.101  \ul0.091  0.403  0.191 
Though, the sizes of the 3D bounding boxes do not differ severely among the approaches, due to the estimating the offset from the median bounding box, the orientation estimation results differ significantly. Since we retrieve global orientation via geometric reasoning, learning local orientation from 2D image features, the network provides more accurate predictions, in contrast to obtaining orientation from the regressed 3D bounding box corners.
Qualitative results.
We provide a qualitative illustration of the network performance in Figure 5, displaying six road scenes with distinct levels of difficulty. In typical cases, our approach produces accurate 3D bounding boxes for all instances, along with the global orientation and 3D location. Remarkably, the truncated objects can also be successfully detected, given that only one pair of keypoints hits the image. Some hard cases, i.e. (e) and (f), primarily consist of objects that are distant, highly occluded or even invisible on the image. We believe such failure cases to be a common limitation of monocular image processing methods.
5 Conclusions
In this work, we presented a novel deep learningbased framework for monocular 3D object detection combining wellknown detectors with geometric reasoning on keypoints. We proposed to estimate correspondences between the detected 2D keypoints and their 3D counterparts annotated on the surface of 3D CAD models to solve the object localization problem. Results of the experimental evaluation of our approach on the subsets of the KITTI 3D object detection benchmark demonstrate that it outperforms the competing stateoftheart approaches when the target objects are clearly visible, leading us to hypothesize that dense depth estimation is redundant for 3D detection in some instances. We have demonstrated our multitask reprojection consistency loss to significantly improve performance, in particular, the orientation of detections.
Acknowledgement
E. Burnaev and A. Artemov were supported by the Russian Science Foundation under Grant 194104109.
References
 [1] F. Chabot, M. Chaouch, J. Rabarisoa, C. TeuliÃ¨re, and T. Chateau. Deep manta: A coarsetofine manytask network for joint 2d and 3d vehicle analysis from monocular image, 2017.
 [2] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In CVPR, 2016.
 [3] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multiview 3d object detection network for autonomous driving. In IEEE CVPR, 2017.
 [4] J. Dai, Y. Li, K. He, and J. Sun. Rfcn: Object detection via regionbased fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
 [5] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 [6] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge, 2010.
 [7] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
 [8] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361. IEEE, 2012.
 [9] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 270–279, 2017.
 [10] P. Guerrero, Y. Kleiman, M. Ovsjanikov, and N. J. Mitra. PCPNet: Learning local shape properties from raw point clouds. Computer Graphics Forum, 37(2):75–85, 2018.
 [11] K. He, G. Gkioxari, P. DollÃ¡r, and R. Girshick. Mask rcnn, 2017.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015.
 [13] M. Hejrati and D. Ramanan. Analyzing 3d objects in cluttered images. In Advances in Neural Information Processing Systems, pages 593–601, 2012.
 [14] B.S. Hua, M.K. Tran, and S.K. Yeung. Pointwise convolutional neural networks, 2017. cite arxiv:1712.05245Comment: 10 pages, 6 figures, 10 tables. Paper accepted to CVPR 2018.
 [15] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander. Joint 3d proposal generation and object detection from view aggregation, 2017.
 [16] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom. Pointpillars: Fast encoders for object detection from point clouds, 2018.
 [17] Y. Li, R. Bu, M. Sun, and B. Chen. Pointcnn. CoRR, abs/1801.07791, 2018.
 [18] J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing ikea objects: Fine pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2992–2999, 2013.
 [19] T.Y. Lin, P. DollÃ¡r, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection, 2016.
 [20] T.Y. Lin, P. Goyal, R. Girshick, K. He, and P. DollÃ¡r. Focal loss for dense object detection, 2017.
 [21] T.Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. DollÃ¡r. Microsoft coco: Common objects in context, 2014.
 [22] Y.L. Lin, V. I. Morariu, W. Hsu, and L. S. Davis. Jointly optimizing 3d model fitting and finegrained classification. In European Conference on Computer Vision, pages 466–480. Springer, 2014.
 [23] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation, 2018.
 [24] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector, 2015.
 [25] P. Marion, P. R. Florence, L. Manuelli, and R. Tedrake. Label fusion: A pipeline for generating ground truth labels for real rgbd data of cluttered scenes. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.
 [26] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2640–2649, 2017.
 [27] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 International Conference on 3D Vision (3DV), pages 506–516. IEEE, 2017.
 [28] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.P. Seidel, W. Xu, D. Casas, and C. Theobalt. Vnect: Realtime 3d human pose estimation with a single rgb camera. volume 36, July 2017.
 [29] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka. 3d bounding box estimation using deep learning and geometry. In CVPR, 2017.
 [30] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from rgbd data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 918–927, 2018.
 [31] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
 [32] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
 [33] Z. Qin, J. Wang, and Y. Lu. Monogrnet: A geometric reasoning network for monocular 3d object localization. arXiv preprint arXiv:1811.10247, 2018.
 [34] V. Ramakrishna, T. Kanade, and Y. Sheikh. Reconstructing 3d human pose from 2d image landmarks. In European Conference on Computer Vision, pages 573–586. Springer, 2012.
 [35] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, realtime object detection, 2015.
 [36] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks, 2015.
 [37] T. Roddick, A. Kendall, and R. Cipolla. Orthographic feature transform for monocular 3d object detection. arXiv preprint arXiv:1811.08188, 2018.
 [38] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d. In ACM transactions on graphics (TOG), volume 25, pages 835–846. ACM, 2006.
 [39] S. Suwajanakorn, N. Snavely, J. J. Tompson, and M. Norouzi. Discovery of latent 3d keypoints via endtoend geometric reasoning. In Advances in Neural Information Processing Systems, pages 2063–2074, 2018.
 [40] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pages 1799–1807, 2014.
 [41] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
 [42] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph cnn for learning on point clouds. CoRR, abs/1801.07829, 2018.
 [43] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Single image 3d interpreter network. In European Conference on Computer Vision, pages 365–382. Springer, 2016.
 [44] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Datadriven 3d voxel patterns for object category recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2015.
 [45] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492–1500, 2017.
 [46] B. Xu and Z. Chen. Multilevel fusion based 3d object detection from monocular images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2345–2353, 2018.
 [47] D. Xu, D. Anguelov, and A. Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 244–253, 2018.
 [48] Y. Yan, Y. Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
 [49] B. Yang, W. Luo, and R. Urtasun. Pixor: Realtime 3d object detection from point clouds. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [50] M. Zeeshan Zia, M. Stark, and K. Schindler. Explicit occlusion modeling for 3d object class representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3326–3333, 2013.
 [51] Y. Zhou and O. Tuzel. Voxelnet: Endtoend learning for point cloud based 3d object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [52] M. Z. Zia, M. Stark, B. Schiele, and K. Schindler. Revisiting 3d geometric models for accurate object shape and pose. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 569–576. IEEE, 2011.
 [53] M. Z. Zia, M. Stark, B. Schiele, and K. Schindler. Detailed 3d representations for object modeling and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2608–2623, 2013.
 [54] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In European conference on computer vision, pages 391–405. Springer, 2014.