3D Pose Estimation and 3D Model Retrieval for Objects in the Wild
Abstract
We propose a scalable, efficient and accurate approach to retrieve 3D models for objects in the wild. Our contribution is twofold. We first present a 3D pose estimation approach for object categories which significantly outperforms the stateoftheart on Pascal3D+. Second, we use the estimated pose as a prior to retrieve 3D models which accurately represent the geometry of objects in RGB images. For this purpose, we render depth images from 3D models under our predicted pose and match learned image descriptors of RGB images against those of rendered depth images using a CNNbased multiview metric learning approach. In this way, we are the first to report quantitative results for 3D model retrieval on Pascal3D+, where our method chooses the same models as human annotators for 50% of the validation images on average. In addition, we show that our method, which was trained purely on Pascal3D+, retrieves rich and accurate 3D models from ShapeNet given RGB images of objects in the wild.
1 Introduction
Retrieving 3D models for objects in 2D images, as shown in Fig. 1, is extremely useful for 3D scene understanding, augmented reality applications and tasks like object grasping or object tracking. Recently, the emergence of large databases of 3D models such as ShapeNet [3] initiated substantial interest in this topic and motivated research for matching 2D images of objects against 3D models. However, there is no straight forward approach to compare 2D images and 3D models, since they have considerably different representations and characteristics.
One approach to address this problem is to project 3D models onto 2D images, which is known as rendering [24]. This converts the task to comparing 2D images, which is, however, still challenging, because the appearance of objects in real images and synthetic renderings can significantly differ. In general, the geometry and texture of available 3D models do not exactly match those of objects in real images. Therefore, recent approaches [2, 10, 23, 28] use convolutional neural networks (CNNs) [7, 8, 22] to extract features from images which are partly invariant to these variations. In particular, these methods compute image descriptors from real RGB images and synthetic RGB images which are generated by rendering 3D models under multiple poses. While this allows them to train a single CNN purely on synthetic data, there are two main disadvantages:
First, there is a significant domain gap between real and synthetic RGB images: Real images are affected by complex lighting, uncontrolled degradation and natural backgrounds. This makes it is hard to render photorealistic images from the available 3D models. Therefore, using a single CNN for feature extraction from both domains is limited in performance, and even domain adaption [13] does not fully account for the different characteristics of real and synthetic images.
Second, processing renderings from multiple poses is computationally expensive. However, this step is mandatory, because the appearance of an object can significantly vary with the pose, and mapping images from all poses to a common descriptor does not scale to many categories [11].
To overcome these limitations, we propose to first predict the object pose and to then use this pose as an effective prior for 3D model retrieval. Inspired by recent works on instance pose estimation [4, 19], we present a robust 3D pose estimation approach for object categories based on virtual control points. More specifically, we use a CNN to predict the 2D projections of virtual 3D control points from which we recover the pose using a PP algorithm. This approach does not only outperform the stateoftheart for viewpoint estimation on Pascal3D+ [29], but also supports categoryagnostic predictions. Having an estimate of the 3D pose makes our approach scalable, as it reduces the matching process to a single rendering per 3D model.
Additionally, we propose to render depth images instead of RGB images and to use different CNNs for feature extraction from the real and synthetic domain. Thus, we are not only able to deal with untextured models, but also to alleviate the domain gap. We implement our 3D model retrieval method using a multiview metric learning approach, which is trained on real and synthetic data from Pascal3D+. In this way, we are the first to present quantitative results for 3D model retrieval on Pascal3D+. Moreover, we demonstrate that our approach retrieves rich and accurate 3D models from ShapeNet given unseen images from Pascal3D+. To summarize, we make the following contributions:

We present a 3D pose estimation approach for object categories which significantly outperforms the stateoftheart on Pascal3D+. Our method predicts virtual control points which generalize across categories making the approach scalable.

We introduce a 3D model retrieval approach which utilizes a pose prior. For this purpose, we match learned image descriptors of RGB images against those of depth images rendered from 3D models under our predicted pose. In this way, we retrieve 3D models from ShapeNet which accurately represent the geometry of objects in RGB images, as shown in Fig. 1.
2 Related Work
Since there is a vast amount of literature on both 3D pose estimation and 3D model retrieval, we focus our discussion on recent works which target these tasks for object categories in particular.
2.1 3D Pose Estimation
Many recent works only perform 3DoF viewpoint estimation and predict the object rotation using regression, classification or hybrid variants of the two. [28] directly regresses azimuth, elevation and inplane rotation using a CNN. [12] compares different variants and presents a regression approach which parameterizes each angle using trigonometric functions. [25, 26] perform viewpoint classification by discretizing the range of each angle into a number of disjoint bins and predicting the most likely bin using a CNN. [24] uses a finegrained geometric structure aware classification, which encourages the correlation between bins of nearby views. [15] formulates the task as a hybrid classification/regression problem: In addition to viewpoint classification, a residual rotation is regressed for each angular bin, and the 3D dimensions of the object are predicted. [14] uses a slightly different parameterization and predicts a 2D translation to refine the object localization in a coarsetofine hybrid approach.
However, predicting a full 6DoF pose instead of a 3DoF viewpoint is desirable for many applications. Therefore, numerous methods compute both rotation and translation from 2D/3D keypoint correspondences. [18] recovers the pose from keypoint predictions and CAD models using a PP algorithm. [26] presents a keypoint prediction approach that combines local keypoint estimates with a global viewpoint estimate. [17] predicts semantic keypoints and trains a deformable shape model which takes keypoint uncertainties into account.
These approaches rely on categoryspecific keypoints which do not generalize across categories. In the context of 3D pose estimation for object instances, [4] therefore considers virtual control points and predicts their 2D projections to estimate the pose from object parts. [19] takes a similar approach, but uses the corners of the object’s 3D bounding box as virtual control points. This work inspired our approach, however, it is not directly applicable for object category pose estimation, since the ground truth 3D model of an object must be known at runtime.
2.2 3D Model Retrieval
One intuitive approach to 3D model retrieval is to rely on classification. [14] performs finegrained category recognition and provides a model for each category. [1] uses a linear classifier on midlevel representations of real images and renderings from multiple viewpoints to predict both shape and viewpoint.
However, retrieval via classification does not scale. Therefore, many recent methods take a metric learning approach. The most common strategy is to train a single CNN to extract features from real RGB images and RGB renderings. [2] uses a CNN pretrained on ImageNet [21] as a feature extractor and matches features of real images against those of 3D models rendered under multiple viewpoints to predict both shape and viewpoint. [10] takes a similar approach, but uses a different network architecture for feature extraction. [13] also employs a pretrained CNN, but additionally performs nonlinear feature adaption to overcome the domain gap between real and rendered images.
[28] finetunes a pretrained CNN using lifted structure embedding [16] and averages the distance of a real image to renderings from multiple viewpoints to be more invariant to object pose. [23] presents a CNN architecture that combines information of renderings from multiple viewpoints into a single object pose invariant descriptor. [11] explicitly constructs an embedding space using a 3D similarity measure evaluated on clean 3D models and trains a CNN to map renderings with arbitrary backgrounds to the corresponding points in the embedding space.
While it is convenient to use RGB images, it is unclear how to deal with untextured 3D models or how to set the scene lighting. Therefore, other methods perform 3D model retrieval using depth instead of RGB images. [5] uses an ensemble of autoencoders followed by a domain adaption layer to match real depth images against depth images of 3D models. [31] computes image descriptors by fusing global autoencoder and local SIFT features of depth images. However, real depth images are not available in many scenarios.
Another approach which alleviates the domain gap and maps different representations to a common space is multiview learning. [6] trains two different networks to map 3D voxel grids and RGB images to a low dimensional embedding space, where 3D model retrieval is performed by matching embeddings of real RGB images against those of voxel grids. [30] also presents a multiview approach using two networks, but maps LDSIFT features extracted from 3D models and depth images to a common space. In contrast to these methods, we map real RGB images and rendered depth images to a common representation. In this way, we do not need to perform computationally expensive 3D convolutions for highresolution voxel grids and do not rely on real depth images.
3 3D Pose Estimation and 3D Model Retrieval
Given an RGB image containing one or more objects, we want to retrieve 3D models with a geometry that corresponds well to the actual objects. Fig. 2 shows our proposed pipeline. We first estimate the 3D pose of an object from an image window roughly centered on the object. In this work, we assume the input image windows are known as in [29] or given by a 2D object detector [20]. Similar to previous works [15, 17, 26], we also assume the object category to be known, as it is a useful prior for both pose estimation and model retrieval. However, we also show that this information is not necessarily required in our approach. In fact, we can retrieve an accurate pose with only a marginal loss of accuracy, when the category is unknown.
After we estimated the object pose, we render a number of candidate 3D models under that pose. In particular, we render depth images, which allows us to deal with untextured 3D models and to circumvent the problem of scene lighting. In order to compare the real RGB image to synthetic depth renderings, we extract image descriptors using two CNNs, one for each domain. Finally, we match these image descriptors to retrieve the closest 3D model.
3.1 3D Pose Estimation
The first step in our model retrieval approach is to robustly compute the 3D pose of the objects of interest. For this purpose, inspired by [4, 19], we predict the 2D image locations of virtual control points. More precisely, we train a CNN to predict the 2D image locations of the projections of the object’s eight 3D bounding box corners. The actual 3D pose is then computed by solving a perspectivepoint (PP) problem, which recovers rotation and translation from 2D3D correspondences. This is illustrated in the first row of Fig. 2.
However, PP algorithms require the 3D coordinates of the virtual control points to be known. Therefore, previous approaches either assume the exact 3D model to be given at runtime [19] or predict the projections of static 3D points [4]. To overcome this limitation, we predict the spatial dimensions of the object’s 3D bounding box and use these to scale a unit cube, which approximates the ground truth 3D coordinates.
For this purpose, we introduce a CNN architecture which jointly predicts the 2D image locations of the projections of the eight 3D bounding box corners (16 values) as well as the 3D bounding box dimensions (3 values). As illustrated in Fig. 3, we implement this architecture as a single 19 neuron linear output layer, which we apply on top of the penultimate layer of different base networks such as VGG [22] or ResNet [7, 8]. During training, we optimize the pose loss
(1) 
which is a linear combination of the projection loss , the dimension loss and the regularization . The metaparameters and control the impact of the different loss terms. Let be the th 3D bounding box corner and its projection using the ground truth rotation R and translation t, then the projection loss
(2) 
is the expected value of the distances between the ground truth projections and the predicted locations of these projections computed by the CNN for the training set. Being aware of inaccurate annotations in datasets such as Pascal3D+ [29], we use the Huber loss [9] in favor of the squared loss to be more robust to outliers.
The dimension loss
(3) 
is the expected value of the distances between the ground truth 3D dimensions and the 3D dimensions predicted by the CNN for the training set. To reduce the risk of overfitting, the regularization in Eq. (1) adds weight decay for all CNN parameters.
3.2 3D Model Retrieval
Having a robust estimate of the object pose, we render 3D models under this pose instead of rendering them under multiple poses [2, 10, 13, 23]. This significantly reduces the computational complexity compared to methods which process multiple renderings for each 3D model and provides a useful prior for retrieval. In contrast to recent approaches [11, 13, 23, 28], we render depth images instead of RGB images. This allows us to deal with 3D models which do not have material or texture. Additionally, we circumvent the problem of how to set the scene lighting.
Before rendering a 3D model, we rescale it to tightly fit into our predicted 3D bounding box. This is done by multiplying all vertices with the minimum of the ratio between the predicted 3D dimensions computed during pose estimation and the model’s actual 3D dimensions. In this way, we improve the alignment between input RGB images and rendered depth images.
However, since RGB images and depth images have considerably different characteristics, we introduce a multiview metric learning approach, which maps images from both domains to a common representation. We implement this mapping using a separate CNN for each domain. For real RGB images, we extract image descriptors from the hidden feature activations of the penultimate layer of our pose estimation CNN (see Fig. 3). As these activations have already been computed during pose estimation inference, we get the real image descriptor without any additional computational cost. For the synthetic depth images, we extract image descriptors using a CNN with the same architecture as our pose estimation CNN, except for the output layer (see Fig. 3).
To finally map images from both domains to a common representation, we optimize the similarity loss
(4) 
which comprises the image descriptor loss and the regularization weighted by the metaparameter .
The image descriptor loss
(5) 
minimizes the expected value of the Triplet loss [27] for the training set. Here, is the Euclidean distance between the real RGB image descriptor and the corresponding synthetic depth image descriptor, is the Euclidean distance between the real RGB image descriptor and a negative example synthetic depth image descriptor, and specifies the margin, i.e., the desired minimum difference between and . To reduce the risk of overfitting, the regularization in Eq. (4) adds weight decay for all CNN parameters.
After the optimization of the CNNs, we can precompute descriptors for synthetic depth images. In this case, we generate multiple renderings for each 3D model, which cover the full pose space. We then compute descriptors for all these renderings and store them in a database. At runtime, we just match descriptors from the viewpoint closest to our predicted pose, which is fast and scalable, but still accurate as shown in our experiments.
4 Experimental Results
To demonstrate our 3D model retrieval approach for objects in the wild, we evaluate it in a realistic setup where we retrieve 3D models from ShapeNet [3] given unseen RGB images from Pascal3D+ [29]. In particular, we train our 3D model retrieval approach purely on data from Pascal3D+, but use it to retrieve 3D models from ShapeNet. The corresponding results are detailed in Sec. 4.2. As estimating an accurate object pose is essential for our retrieval approach, we additionally evaluate our pose estimation approach on Pascal3D+ in Sec. 4.1.
4.1 3D Pose Estimation
In the following, we first give a detailed evaluation of our pose estimation approach. Then, we compare it to previous methods, outperforming the stateoftheart for viewpoint estimation on Pascal3D+. Finally, we demonstrate that we are even able to top the stateoftheart without providing the correct category prior in some cases. For a fair evaluation, we follow the evaluation protocol of [26], which quantifies 3DoF viewpoint prediction accuracy on Pascal3D+ using the geodesic distance
(6) 
to measure the difference between the ground truth viewpoint rotation matrix and the predicted viewpoint rotation matrix . In particular, we report two metrics: (the median of all viewpoint differences) and (the percentage of all viewpoint differences smaller than respectively ). Evaluating our approach using the metric [29], which couples 2D object detection and azimuth classification, is not meaningful as it is very different from our specific task.
4.1.1 3D Pose Estimation on Pascal3D+
Table 1 presents quantitative results for 3DoF viewpoint estimation on Pascal3D+ using our approach in different setups, starting from a baseline using VGG to a more elaborated version building on ResNet. Specific implementation details and other parameters are provided in the supplementary material. For our baseline approach (Ours  VGG) we build on VGG and finetune the entire network for our task similar to [15, 25, 26]. As can be seen from Table 2, this baseline already matches the stateoftheart.
Ours  VGG  11.7  0.8076 

Ours  VGG+blur  11.6  0.8033 
Ours  ResNet  10.9  0.8341 
Ours  ResNet+blur  10.9  0.8392 
When inspecting the failure cases, we see that many of them relate to small objects. In these cases, object image windows need to be upscaled to fit the fixed spatial input resolution of pretrained CNNs. This results in blurry images and VGG, which only employs 33 convolutions, performs poorly at extracting features from oversmoothed images.
Therefore, we propose to use a network with larger kernel sizes that performs better at handling oversmoothed input images such as ResNet50 [7, 8], which uses 77 kernels in the first convolutional layer. As presented in Table 1, our approach with ResNetbackend (Ours  ResNet) significantly outperforms the VGGbased version. In addition, the total number of network parameters is notably lower (VGG: 135M vs. ResNet: 24M).
To further improve the performance, we employ data augmentation in the form of image blurring. Using ResNet as a base network together with blurring training images (Ours ResNet+blur), we improve on the metric while maintaining low (see Table 1). This indicates that we improve the performance on oversmoothed images, but do not loose accuracy on sharp images. While our approach with ResNetbackend shows increased performance in this setup, we do not benefit from training on blurred images using a VGGbackend (Ours  VGG+blur). This also confirms that VGG is not suited for feature extraction from oversmoothed images. For all following experiments, we use our best performing setup, i.e., Ours  ResNet+blur.
categoryspecific  
aero  bike  boat  bottle  bus  car  chair  table  mbike  sofa  train  tv  mean  
([17])  11.2  15.2  37.9  13.1  4.7  6.9  12.7  N/A  N/A  21.7  9.1  38.5  N/A 
([17]*)  8.0  13.4  40.7  11.7  2.0  5.5  10.4  N/A  N/A  9.6  8.3  32.9  N/A 
([26])  13.8  17.7  21.3  12.9  5.8  9.1  14.8  15.2  14.7  13.7  8.7  15.4  13.6 
([15])  13.6  12.5  22.8  8.3  3.1  5.8  11.9  12.5  12.3  12.8  6.3  11.9  11.1 
([24]**)  15.4  14.8  25.6  9.3  3.6  6.0  9.7  10.8  16.7  9.5  6.1  12.6  11.7 
(Ours)  10.0  15.6  19.1  8.6  3.3  5.1  13.7  11.8  12.2  13.5  6.7  11.0  10.9 
([26])  0.81  0.77  0.59  0.93  0.98  0.89  0.80  0.62  0.88  0.82  0.80  0.80  0.8075 
([15])  0.78  0.83  0.57  0.93  0.94  0.90  0.80  0.68  0.86  0.82  0.82  0.85  0.8103 
([24]**)  0.74  0.83  0.52  0.91  0.91  0.88  0.86  0.73  0.78  0.90  0.86  0.92  0.8200 
(Ours)  0.83  0.82  0.64  0.95  0.97  0.94  0.80  0.71  0.88  0.87  0.80  0.86  0.8392 
categoryagnostic  
aero  bike  boat  bottle  bus  car  chair  table  mbike  sofa  train  tv  mean  
(Ours)  10.9  12.2  23.4  9.3  3.4  5.2  15.9  16.2  12.2  11.6  6.3  11.2  11.5 
(Ours)  0.80  0.82  0.57  0.90  0.97  0.94  0.72  0.67  0.90  0.80  0.82  0.85  0.8133 
4.1.2 Comparison to the StateoftheArt
Next, we compare our pose estimation approach to stateoftheart methods on Pascal3D+. Quantitative results are presented in Table 2. Our approach significantly outperforms the stateoftheart in both and considering mean performance across all categories and also shows competitive results for individual categories.
However, the scores for two categories, boat and table, are significantly below the mean. We analyze these results in more detail. The category boat is the most challenging category due to the large intraclass variability in shape and appearance. Many detections for this category are of low resolution and often objects are barely visible because of fog or mist. Additionally, there are a lot of ambiguities, e.g., even a human cannot distinguish between the front and the back of an unmanned canoe. Nevertheless, we outperform the stateoftheart for this challenging category.
The low scores for the category table can be explained by three factors. First, many tables are partly occluded by chairs (see table in Fig. 4). Second, the evaluation protocol does not take into account that many tables are ambiguous with respect to an azimuth rotation of , or even have an axis of symmetry, e.g., a round table. In some cases, our system predicts an ambiguous pose instead of the ground truth pose, while it is not possible to differentiate between the two poses. The evaluation protocol needs to be changed to take this into account. Last, the number of validation samples is very small (i.e., ) and, therefore, the reported results for this category are highly biased.
4.1.3 CategoryAgnostic Pose Estimation
So far, the discussed results are categoryspecific, which means that the ground truth category must be known at runtime. In fact, all methods use a separate output layer for each category. However, our approach is able to make categoryagnostic predictions which generalize across different categories. In this case, we use a single 19 neuron output layer which is shared for all categories making our approach scalable. Our categoryagnostic pose estimation even outperforms the previous categoryspecific stateoftheart for some categories, because it fully leverages the mutual information between similar categories like bike and mbike, for example, as shown in Table 2.
aero  bike  boat  bottle  bus  car  chair  table  mbike  sofa  train  tv  mean  

Top1Acc (Rand)  0.15  0.21  0.36  0.25  0.25  0.10  0.15  0.10  0.28  0.31  0.27  0.27  0.2250 
Top1Acc (Cano)  0.12  0.25  0.38  0.35  0.45  0.21  0.20  0.15  0.20  0.21  0.49  0.50  0.2925 
Top1Acc (Off)  0.48  0.33  0.58  0.41  0.75  0.35  0.28  0.10  0.44  0.28  0.62  0.63  0.4375 
Top1Acc (Pred)  0.48  0.31  0.60  0.41  0.78  0.41  0.29  0.19  0.43  0.36  0.65  0.61  0.4600 
Top1Acc (GT)  0.53  0.38  0.51  0.37  0.79  0.44  0.32  0.43  0.48  0.33  0.66  0.72  0.4967 
4.2 3D Model Retrieval
Now we demonstrate our 3D model retrieval approach using our predicted pose. First, we present a quantitative evaluation of our approach on Pascal3D+. Second, we show qualitative results for 3D model retrieval from ShapeNet given images from Pascal3D+. Finally, we use our predicted 6DoF pose and 3DoF dimensions to precisely align retrieved 3D models with objects in real world images.
4.2.1 3D Model Retrieval from Pascal3D+
Since Pascal3D+ provides correspondences between RGB images and 3D models as well as pose annotations, we can train our approach purely on this dataset. In fact, we are the first to report quantitative results for 3D model retrieval on this dataset. For this purpose, we compute the top1accuracy (Top1Acc), i.e., the percentage of evaluated samples for which the top retrieved model equals the ground truth model. This task is not trivial, because many models in Pascal3D+ have similar geometry and are hard to distinguish. Thus, we evaluate our approach using five different pose setups, i.e., the ground truth pose (GT), our predicted pose (Pred), our predicted pose with offline precomputed descriptors (Off), a canonical pose (Cano) and a random pose (Rand). Table 3 shows quantitative retrieval results.
As expected, we achieve the highest accuracy assuming the ground truth pose to be known (GT). In this case, our approach chooses the same 3D models as human annotators for 50% of the validation images on average. However, if we render the 3D models under our predicted pose (Pred), we almost match the accuracy of the ground truth pose setup. For some categories, we observe even better accuracy when using our predicted pose. This proves the high quality of our predicted pose. Moreover, our approach is fast and scalable at runtime while almost maintaining accuracy by using offline precomputed descriptors (Off). For this experiment, we discretize the pose space in intervals of and precompute descriptors for the 3D models. At runtime, we only match precomputed descriptors from the discretized pose which is closest to our predicted pose and do not have to render 3D models online. If we, in contrast, just render the 3D models under a random pose (Rand) the performance decreases significantly. Rendering models under a frontal view (Cano) on the other hand provides a useful bias for the categories train, bus and tv monitor which are frequently seen from an almost frontal view in this dataset. These results confirm the importance of fine pose estimation in our approach.
4.2.2 3D Model Retrieval from ShapeNet
In contrast to Pascal3D+, ShapeNet provides a significantly larger spectrum of 3D models. Thus, we now evaluate our retrieval approach trained purely on Pascal3D+ for 3D model retrieval from ShapeNet given previously unseen images from Pascal3D+. Fig. 4 shows qualitative retrieval results for all twelve categories. Our approach predicts accurate 3D poses and 3D models for objects of different categories. In some cases, our predicted pose (see sofa in Fig. 4) or our retrieved model from ShapeNet (see aeroplane and chair in Fig. 4) are even more accurate than the annotated ground truth from Pascal3D+. While the geometry of the retrieved models corresponds well to the objects in the query images, the materials and textures typically do not. The reason for this is that we use depth images for retrieval, which do not include color information. This issue can be addressed by extracting texture information from the query RGB image or by performing retrieval with RGBD images. However, this is up to future research. Fig. 5 shows failure cases of our approach. If the pose estimation fails, the model retrieval becomes even more difficult. This is also reflected in Table 3, where we observe a strong decrease in performance when we render models without pose information (Rand and Cano). Also, if there is too much clutter in the query image, we cannot retrieve an accurate 3D model.
4.2.3 3D Model Alignment
Finally, we use our predicted 6DoF pose and 3DoF dimensions to precisely align retrieved 3D models with objects in real world images. Fig. 6 shows how we improve the 2D object localization and the alignment between the object and a rendering using our predicted pose and dimensions. This is especially useful if the object detection is not fully accurate, which is true in almost all situations. In this case, the detected image windows are a bit too small and the objects are not centered in the image windows. Thus, if we just render a model under our predicted rotation, rescale it to tightly fit into the 2D image window, and center it in the 2D image window, the alignment is poor. However, if we additionally use our predicted translation and 3D dimensions for scaling and positioning, we significantly improve the alignment between the object and the rendering. This is of tremendous importance for robotics or augmented reality applications.
5 Conclusion
3D object retrieval from RGB images in the wild is an important but challenging task. Existing approaches address this problem by training on vast amounts of synthetic data. However, there is a significant domain gap between real and synthetic images which limits performance. For this reason, we learn to map real RGB images and synthetic depth images to a common representation. Additionally, we show that estimating the object pose is a useful prior for 3D model retrieval. Our approach is scalable as it supports categoryagnostic predictions and offline precomputed descriptors. We do not only outperform the stateoftheart for viewpoint estimation on Pascal3D+, but also retrieve accurate 3D models from ShapeNet given unseen RGB images from Pascal3D+. Finally, these results motivate future research on jointly learning from real and synthetic data.
Acknowledgement
This work was funded by the Christian Doppler Laboratory for Semantic 3D Computer Vision.
References
 [1] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic. Seeing 3D Chairs: Exemplar PartBased 2D3D Alignment Using a Large Dataset of Cad Models. In Conference on Computer Vision and Pattern Recognition, pages 3762–3769, 2014.
 [2] M. Aubry and B. Russell. Understanding Deep Features with ComputerGenerated Imagery. In Conference on Computer Vision and Pattern Recognition, pages 2875–2883, 2015.
 [3] A. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An InformationRich 3D Model Repository. Technical report, Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
 [4] A. Crivellaro, M. Rad, Y. Verdie, K. Moo Yi, P. Fua, and V. Lepetit. A Novel Representation of Parts for Accurate 3D Object Detection and Tracking in Monocular Images. In International Conference on Computer Vision, pages 4391–4399, 2015.
 [5] R. Girdhar, D. Fouhey, M. Rodriguez, and A. Gupta. Feng, Jie and Wang, Yan and Chang, ShihFu. In IEEE Winter Conference on Applications of Computer Vision, pages 1–9, 2016.
 [6] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a Predictable and Generative Vector Representation for Objects. In European Conference on Computer Vision, pages 484–499, 2016.
 [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 [8] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. In European Conference on Computer Vision, pages 630–645, 2016.
 [9] P. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964.
 [10] H. Izadinia, Q. Shan, and S. Seitz. IM2CAD. In Conference on Computer Vision and Pattern Recognition, 2017.
 [11] Y. Li, H. Su, C. R. Qi, N. Fish, D. CohenOr, and L. Guibas. Joint Embeddings of Shapes and Images via CNN Image Purification. ACM Transactions on Graphics, 34(6):234, 2015.
 [12] F. Massa, R. Marlet, and M. Aubry. Crafting a MultiTask CNN for Viewpoint Estimation. In British Machine Vision Conference, pages 911–9112, 2016.
 [13] F. Massa, B. Russell, and M. Aubry. Deep Exemplar 2D3D Detection by Adapting from Real to Rendered Views. In Conference on Computer Vision and Pattern Recognition, pages 6024–6033, 2016.
 [14] R. Mottaghi, Y. Xiang, and S. Savarese. A CoarseToFine Model for 3D Pose Estimation and SubCategory Recognition. In Conference on Computer Vision and Pattern Recognition, pages 418–426, 2015.
 [15] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka. 3D Bounding Box Estimation Using Deep Learning and Geometry. In Conference on Computer Vision and Pattern Recognition, pages 7074–7082, 2017.
 [16] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep Metric Learning via Lifted Structured Feature Embedding. In Conference on Computer Vision and Pattern Recognition, pages 4004–4012, 2016.
 [17] G. Pavlakos, X. Zhou, A. Chan, K. Derpanis, and K. Daniilidis. 6DoF Object Pose from Semantic Keypoints. In International Conference on Robotics and Automation, pages 2011–2018, 2017.
 [18] B. Pepik, M. Stark, P. Gehler, T. Ritschel, and B. Schiele. 3D Object Class Detection in the Wild. In Conference on Computer Vision and Pattern Recognition Workshops, pages 1–10, 2015.
 [19] M. Rad and V. Lepetit. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects Without Using Depth. In International Conference on Computer Vision, pages 3828–3836, 2017.
 [20] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
 [21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [22] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. arXiv:1409.1556, 2014.
 [23] H. Su, S. Maji, E. Kalogerakis, and E. LearnedMiller. MultiView Convolutional Neural Networks for 3D Shape Recognition. In Conference on Computer Vision and Pattern Recognition, pages 945–953, 2015.
 [24] H. Su, C. Qi, Y. Li, and L. Guibas. Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views. In International Conference on Computer Vision, pages 2686–2694, 2015.
 [25] S. Tulsiani, J. Carreira, and J. Malik. Pose Induction for Novel Object Categories. In International Conference on Computer Vision, pages 64–72, 2015.
 [26] S. Tulsiani and J. Malik. Viewpoints and Keypoints. In Conference on Computer Vision and Pattern Recognition, pages 1510–1519, 2015.
 [27] K. Weinberger and L. Saul. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 10:207–244, 2009.
 [28] Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese. ObjectNet3D: A Large Scale Database for 3D Object Recognition. In European Conference on Computer Vision, pages 160–176, 2016.
 [29] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond Pascal: A Benchmark for 3D Object Detection in the Wild. In IEEE Winter Conference on Applications of Computer Vision, pages 75–82, 2014.
 [30] J. Zhu, F. Zhu, E. Wong, and Y. Fang. Learning Pairwise Neural Network Encoder for Depth ImageBased 3D Model Retrieval. In ACM International Conference on Multimedia, pages 1227–1230, 2015.
 [31] J. Zhu, F. Zhu, E. Wong, and Y. Fang. Deep Learning Representation Using Autoencoder for 3D Shape Retrieval. Neurocomputing, 204:41–50, 2016.
3D Pose Estimation and 3D Model Retrieval for Objects in the Wild
Supplementary Material

In the following, we provide additional qualitative results for our 3D model retrieval approach in Sec. A, which complement those presented in the paper. Furthermore, we analyze failure cases for both 3D model retrieval and the underlying 3D pose estimation in Sec. B. Finally, in Sec. C we discuss implementation details, parameter choices, and other relevant settings.
Appendix A 3D Model Retrieval
Fig. 7 shows additional qualitative results for 3D model retrieval from ShapeNet [3] given previously unseen images from Pascal3D+ [29] validation data for all twelve categories. Our approach predicts accurate 3D poses and 3D models for objects of different categories.
Fig. 8 presents further 3D model alignment results for object detections which are not fully accurate. We significantly improve the alignment between the object in the image and an RGB rendering of our retrieved 3D model by taking advantage of our predicted 6DoF pose and 3DoF dimensions compared to just using a 3DoF viewpoint.
Appendix B Failure Modes
Most failure cases of our 3D pose estimation on Pascal3D+ relate to lowresolution or ambiguous objects.
Fig. 9 shows 3D pose estimation results on lowresolution image windows from Pascal3D+ validation data. After rescaling, the oversmoothed input RGB images lack details and sharp discontinuities, which results in incorrect pose predictions. In fact, even for a human it is difficult to identify the correct object poses in these examples.
Fig. 10 shows additional failure cases, observing that heavy occlusions, bad illumination conditions and difficult object poses, which are far from the poses seen during training, result in incorrect pose predictions.
As shown in Fig. 11, some objects from Pascal3D+ are symmetrical, which makes their poses not well defined. For example, it is impossible to differentiate between the front and back of a symmetric unmanned boat. This issue is even more apparent for tables: Many tables are ambiguous with respect to an azimuth rotation of , or even have an axis of symmetry, such as a round table. When our approach predicts one of the possible poses that is not the annotated ground truth pose, this is considered as a mistake by the commonly used evaluation protocol [26].
Fig. 12 shows that visual distortions due to wideangle lenses (i.e., fisheye effects), deformed and demolished objects and heavy occlusions can disturb the model retrieval step, even if the pose estimation was successful.
Appendix C Implementation Details
In the following, we provide implementation details and other parameters used in our work:
Intrinsic camera parameters: In Pascal3D+, the ground truth poses were computed from 2D3D correspondences assuming the same intrinsic parameters for all images. We employ the same parameters in our approach.
Data augmentation: Like others [15, 17, 24, 26], we perform data augmentation by jittering ground truth detections and exclude detections marked as occluded or truncated from the evaluation. Additionally, we augment samples for which the longer edge of the ground truth image window is greater than 224 pixel by applying Gaussian blurring with various kernel sizes and . We randomly sample negative example 3D models from the available data. All augmentation parameters are randomized after each training epoch.
Meta parameters: We normalize the projections so that the image pixel range is mapped to the interval [0,1] and use the same Huber loss () for all 19 estimated values. Experimentally, we found , and to work well and set .
Network parameters: We use a batch size of 50, train our networks for 100 epochs and decrease the initial learning rate of by one order of magnitude after 50 and 90 epochs, and employ the Adam optimization algorithm.
3D dimensions: For both Pascal3D+ and ShapeNet, 3D models are normalized to fit within a unit cube centered at the origin. Thus, we estimate 3D dimensions in model space in the range [0,1]. Since these dimensions tend to be consistent within a category, estimating them is not a major issue. Table 4 shows quantitative results for 3D dimension estimation. We achieve high accuracy across all categories.
x  y  z  
Median Absolute Error  0.022  0.015  0.014 