Optimizable Object Reconstruction from a Single View

Optimizable Object Reconstruction from a Single View

Kejie Li,Ravi Garg, Ming Cai, Ian Reid
the University of Adelaide

3D shape reconstruction from a single image is a highly ill-posed problem. A number of current deep learning based systems aim to solve the shape reconstruction and shape pose problems by learning an end-to-end network to perform feed-forward inference. More traditional (non-deep learning) methods cast the problem in an iterative optimization framework. In this paper, inspired by these more traditional shape-prior-based approaches, which separate the 2D recognition and 3D reconstruction, we develop a system that leverages the power of both feed-forward and iterative approaches. Our framework uses the power of deep learning to capture 3D shape information from training data and provide high quality initialization, while allowing both image evidence and shape priors to influence iterative refinement at inference time. Specifically, we employ an auto-encoder to learn a latent space of object shapes, a CNN that maps an image to the latent space, another CNN to predict 2D keypoints to recover object pose using PnP, and a segmentation network to predict an object’s silhouette from an RGB image. At inference time these components provide high quality initial estimates of the shape and pose, which are then further optimized based on the silhouette-shape constraint and a probabilistic shape prior learned on the latent space (see Figure 1). Our experiments show that this optimizable inference framework achieves state-of-the-art results on a large benchmarking dataset with real images.

1 Introduction

Humans effortlessly infer the 3D structure of objects from a single image, including the parts of the object which are not directly visible in the image. It is obvious that the human perception of object geometry involves, in large part, the vast experience of having seen the objects of a category from multiple-views, having reasoned about their geometry and having build rich prior knowledge about meaningful object structure for specific object category. However single view object shape reconstruction remains a challenging problem for modern vision algorithms – perhaps because of the difficulty of capturing and representing this prior shape information – and the solutions to 3D geometric perception have been traditionally dominated by multi-view reconstruction pipelines.

Figure 1: Given a single image, a deep neural net predicts the silhouette of the object, and also initializes the object pose and latent code of the shape. Both shape and pose are then optimized to fit to the predicted silhouette while the shape remains plausible according to the learned shape prior.

The recent successes of deep learning have motivated researchers to depart from traditional multi-view reconstruction methods and directly learn the image to object-shape mapping using convolution neural networks [20, 10, 17, 48, 47, 53, 51]. These feed-forward networks show impressive results, with prior information about shape captured at training time and represented implicitly in the network’s weights. An alternative way to reason about the 3D geometry of the object is by combining the 2D image cues (for example object silhouette) with explicitly learned priors about the object shape – thereby separating the 2D recognition from 3D reconstruction. This two stage framework has been successfully used for inferring 3D shape from images prior to the resurgence of deep learning by [40, 39] in single-view and [4, 13] in multi-view reconstruction frameworks.

For example, [40, 39] propose to estimate the 3D object shape and pose given a single silhouette. To make this highly ill-posed problem well-posed, these works learned a probabilistic latent space embedding using a Gaussian Process Latent Variable Model (GPLVM). Object shape and pose inference (reconstruction) was formulated as an optimization problem to find the most likely shape (i.e. generated from the low variance area of the learned latent space) which can be projected using the optimal pose to the image plane inducing object segmentation consistent with the one produced by a recognition method. Using non-linear dimensional reduction and an inference method which successfully accounts for the variance of the generated shape, [40, 39] achieved the state-of-the-art reconstructions for simple object like cars in presence of heavy occasion and background clutter. Furthermore the iterative nature of these algorithms leads very naturally – almost trivially – to extension to a tracking framework of temporal refinement as pose or even shape change dynamically [40].

These methods worked quite well for object categories with small shape variations and which lack high frequency detail – like cars or human bodies – and they can produce dense and accurate volumetric reconstruction. However more complex objects like chairs, with larger shape variation and thin structures, were significant failure cases. Some of the limitations arise due to the use of GPLVM for dimensionality reduction. GPLVM does not scale well to be trained on large datasets or to learn priors on high-dimensional shape representations. This forces the [40] to use frequency based compression techniques for the representing shapes leading to loss of thin and other high frequency object structures.

Deep learning has emerged as an attractive alternative to address these issues, and we explore the possibilities in this paper. In particular, we design a method that uses deep neural nets to initialize object shape as a latent code and pose, which are at inference time optimized in a manner analogous to traditional iterative approaches; i.e. by optimizing the alignment between the image silhouette and the expected projection of the object, guided by a shape prior term captured via a deep auto-encoder. The auto-encoder’s latent space is given a probabilistic interpretation using a Gaussian Mixture Model (GMM) on the latent codes; we show that this yields a learnable, simple but effective probabilistic prior. The image silhouette is obtained using a conventional segmentation network, and the learned shape prior helps avoid unrealistic shapes (because these correspond to low confidence regions of the GMM). An illustration of the framework can be found in Figure 2.

In summary, our contributions are as follows:

  • Based on current best-practice, we design networks for shape reconstruction and pose estimation that we demonstrate lead to state-of-the-art reconstruction results even in purely feed-forward operation.

  • Our feed-forward image-to-shape regression explicitly includes a latent shape space created using a deep auto-encoder, using the power of deep learning for successful dimensionality reduction. We show that a simple GMM fitting to this latent space can be used as an effective probabilistic shape prior.

  • We combine these properties to yield the first fully deep shape-prior aware inference method for in-pose object reconstruction. We show that this inference-as-optimization – essentially a deep version of [39] – improves the previous state of the art by a large margin.

We would like to distinguish our work from two recently proposed works of MarrNet [51] and Zhu et al.[59]. In these works, the authors also propose to refine the shape prediction from a neural network to fitting the final reconstruction into a predicted silhouette. However, neither of the above mentioned frameworks uses a pre-learned probabilistic shape prior for the purpose of their refinement. As a result, unrealistic shapes may be generated and damage the reconstruction quality. We empirically show the importance of the proposed shape prior in our ablation study to show a clear advantage of the proposed method against these works.

Figure 2: Our Framework. (a) Two stage training of the point cloud auto-encoder and image regressor: We train the point cloud auto-encoder at the first stage to learn a latent space for 3D shapes, followed by the training of image regressor that maps an image to a latent code corresponding to its 3D shape. (b)Learning probabilistic shape prior on the latent space: A set of 3D point clouds are mapped to the latent codes by the trained point cloud encoder. A GMM as a probabilistic model of the latent space is learned by fitting on this set of latent codes. (c) Our inference-as-optimization: Given a single image, we estimate the object silhouette using the Sil-net. The optimization starts with the initialization of object pose given by the KP-net and shape as a latent code given by the image regressor. The shape and pose are optimized guided by the probabilistic shape prior via the learned GMM and the silhouette-shape constraint. The orange, green, blue, and purple blocks correspond to the components for shape, pose, silhouette, and loss terms respectively. The red arrows represent the gradient flow with respect to the latent code and the object pose during inference.

2 Related Works

3D understanding of object shapes from images is considered an important step in scene perception. While mapping an environment using Structure from Motion [46, 3] and SLAM [14, 38] facilitates localization and navigation, a higher level of understanding about objects in terms of their shape and relative position with respect to the rest of the background is essential for their manipulation. Initial works, to localize objects while estimating its pose from an image, are limited to the case where a pre-scanned object structure is available [41, 18, 42].

Soon it is realized that the valid 3D shapes of objects belonging to a specific category are highly correlated and dimensionality reduction emerges as a prominent tool to model object shapes. Works like [49] reconstruct traditional image segmentation datasets like PASCAL VOC [16] by extending the ideas of non-rigid Structure from Motion [7, 6, 11, 19]. Methods like [31] propose to use these learned category specific object shape manifolds to reconstruct shape of the object from single image by fitting the reconstructed shape to a single image silhouette and refine it with simple image based methods like shape from shading [28].

While the methods mentioned above attempt to learn linear object shape manifolds from 2D key-point correspondences or object segmentation annotations, increasing availability of class specific 3D CAD models and object scans allow researchers to learn complex manifolds for shapes within an object category directly from data. For example, Gaussian Process Latent Variable Model or Kernel/simple Principal Component Analysis is employed to learn a compact latent space of object shapes in [12, 40, 13, 15]. Taking advantage of modern deep learning techniques, deep auto-encoders [27, 50], VAEs [34], and GANs [21, 53] can be better alternatives to model object shapes than traditional dimensionality reduction tools.

Recently, with the advance of deep learning, there are many end-to-end systems proposed to solve the object shape reconstruction problem from a single RGB image directly. Different representations are adopted to represent 3D shapes in the deep learning based methods. Volumetric representation is dominant in the early works [10, 20, 24, 48, 51] because it is considered a natural extension of dense 2D segmentation. 3D deconvolutions replace traditional 2D convolutions in the CNNs which are used for image segmentation. Octree [25, 45] and Discrete Cosine Transform [30] are proposed to reduce the memory requirement for predicting high-resolution volumetric grid. Other representations such as Mesh [32, 57], mutli-view depth maps [44, 36, 35], and point cloud [17, 29] are also explored. It is shown in [17, 29] that point cloud representation is more efficient than volumetric representation and simpler than mesh and Octree in terms of designing a suitable network architecture and training loss.

Another crucial aspect that affects the performance of a deep learning solution for shape reconstruction is the role of object pose. One stream of research [17, 51] proposes to incorporate pose implicitly into the shape reconstruction network where the predicted shape is aligned with the input image. In another stream, it is shown in [58, 43] that decoupling the shape and pose while reasoning about object structure from single image has performance benefits [58, 43]. For instance, a canonical shape embedding in low-dimensional space is employed in [20, 53]. Besides the fully-supervised training used in [20, 10, 17], researchers also explore weakly supervised training of the network without the explicit 3D loss. For instance, In Yan et al.[56], Tulsiani et al.[48], a 3D shape generated by the network is supervised by its projections at multiple viewpoints with the corresponding ground-truth silhouettes. In particular, Zhu et al.[58] mitigate the domain gap between synthetic images used in training and real images at test by finetuning their network on a small amount of real images with ground-truth silhouette annotation. The weakly supervised training loss shares similarity with our silhouette-shape constraint during our inference.

Following the best practices in deep learning regime, we choose to use point cloud as shape representation, decouple shape and pose by reconstructing 3D shapes in a canonical pose and estimating object pose separately, and a two-step training schedule for shape reconstruction network used TL-embedding network [20], where a latent space for 3D shapes is learned explicitly via an auto-encoder and then an image regressor that maps an RGB image to the learned latent space is trained. On top of these deep learning components, we use the predicted silhouette and learned shape prior to transform the one-shot prediction from deep networks to an optimization at inference time.

Similar to our framework, CodeSLAM [5], MarrNet [51], and Zhu et al.[59] also use a low-dimensional latent space to represent the geometry information. For the scene-level geometry reconstruction, CodeSLAM uses nearby views to form a photometric loss, which is minimized by searching in the learned latent space of depth maps. MarrNet [52] and Zhu et al.[59] enforce the 2.5D (silhouette, depth or normal) - 3D shape constraint and photometric loss respectively to search in the latent space of object shapes. Nevertheless, a common disadvantage of these works, which we address in our work, is that they do not consider a probabilistic prior in the latent space explicitly when they perform optimization, which we show is essential for the optimization at inference time through our ablation study.

3 Method

The goal of this work is to reconstruct a 3D shape of an object and its pose with respect to the camera from a single image. Our framework closely follows traditional object prior based 3D reconstruction methods by framing the reconstruction as an optimization problem with respect to the object shape and pose assuming that a deep neural network provides us with the object silhouette. This inference method is described in the Section 3.1. To facilitate the inference, we require a low-dimensional latent space which encodes the object shape, a learned probabilistic prior on this latent space providing us with the likelihood of any point in the latent space, a single view object silhouette prediction to guide the shape and pose optimization and initialization for object shape and pose. How we learn the relevant prerequisites for inference is elaborated in Section 3.2.

3.1 Inference-as-optimization

For inference, we are given with sampled points from the predicted silhouette , initial latent code of the object shape , initial pose (in terms of rotation and translation ) each predicted using neural nets. Additionally, the learned probabilistic prior on the latent space is available in the form of a GMM.

We minimize the objective function defined below with respect to the shape and pose variables , and :


Here corresponds to the silhouette-shape constraint and is the term related to the probabilistic shape prior, which we explain in the rest of this section. is the weighting factor between these two loss terms.

Silhouette-shape Constraint Term. It enforces the projection of a transformed 3D shape to be consistent with the object’s silhouette on the image. While using point cloud representation, this can be achieved by defining a 2D Chamfer loss between projection of the transformed 3D point cloud and sampled points of the predicted silhouette shown in Equation (2),


where is a point cloud reconstruction in a predefined canonical pose generated by a latent code through the point cloud decoder. is a transformation matrix that is composed of , camera intrinsic matrix , object rotation and its translation . is a set of 2D sampled points from a predicted silhouette.

Shape Prior Term. Because of the highly unconstrained nature of the silhouette-shape constraint and incorrect silhouette, we employ a shape prior term, which quantifies the likelihood of a latent code using a probabilistic model. This can mitigates the problem of generating unrealistic 3D shapes. Specifically, the shape prior minimizes the negative log likelihood of a latent code in the GMM, which can be written as:


where , , and are the weight, mean, and covariance matrix of the Gaussian component, and is the number of Gaussian components.

3.2 Learning

As explained in Section 3.1, our inference method requires learning a probabilistic low-dimensional representation of object shape from data and three networks predicting object silhouette, initial object pose and object shape in the form of a latent code. In this section we describe how we learn these components.

3.2.1 Learning Shape Latent Space and Prior

Our inference method heavily rely on learning a low-dimensional latent space of shape from which a decoder network can map a latent code to a unique 3D point cloud . To learn this mapping, we train an encoder-decoder network. Similar to a conventional autoencoder, our encoder maps a point cloud to its corresponding low dimensional latent code , whereas the decoder reconstructs the point cloud from the latent code . We train this autoencoder using 3D Chamfer loss as defined below:


To learn a probabilistic model as the shape prior, we fit a GMM to the learned latent space. Specifically, we map the point clouds from the training set to the latent space using the point cloud encoder, to get a set of latent codes . The parameters of GMM with components are then learned to maximize the likelihood of the training set’s latent codes by using an EM algorithm to optimize the following objective:


where , , and are the weight, mean and covariance matrix of the Gaussian component, and and are the number of sample latent codes and the number of Gaussian component.

There are a number of generative models that can be used as a shape prior. We deploy GMM with simple auto-encoder following the recent finding of [2] which empirically show that, in the form of point cloud, among variants of GAN, and GMM, the latter outperforms other models in generalization and diversity of shapes, meaning that the GMM captures a better distribution of object shapes.

3.2.2 Learning segmentation and initialization

Our inference requires three networks each taking an RGB image containing the object of interest as the input. These networks are: (i) 3D-net: predicts the latent code corresponding to an image which can be used to initialize the object shape using the decoder . (ii) KP (KeyPoint)-net: predicts the projection locations of 8 corner points of the object 3D bounding box, which are used to recover the initial object pose and with respect to camera. (iii) Sil-net: predicts the object’s silhouette in the image which guides the inference. In this section we outline the procedure adapted in this work to train these components.

3D-net. This network consists of two parts. An image regressor learns to map an image to the latent code which is decoded by the learned introduced in Section 3.2.1 to generate a 3D point cloud. During training, when the 3D point cloud corresponding to the image is available, the image regressor is trained by minimizing the error between its prediction and the output of the trained given the ground-truth 3D point cloud. During inference, the image regressor maps the input image to a latent code which is used as an initialization of object shape in the latent space.

KP-net. We define eight corner points of the object’s 3D bounding box in the predefined canonical pose as 3D keypoints. The KP-net learns to predict the projection locations of these eight points onto the image plane using the ground truth object pose available at the training time. loss between the ground-truth 2D projections of the 3D keypoints and the prediction is employed to supervise the training.

The 2D-3D keypoint correspondence is used to recover the object pose with respect to the camera, which is used as the initialization for inference (Section 3.1). More specifically, during inference time, we retrieve the 3D corner points from the point cloud reconstruction of the 3D-net, together with the KP-net prediction to form the 2D-3D correspondence. Object pose estimation is then formulated as a PnP problem as shown in Equation (6) and solved iteratively using Levenberg Marquardt algorithm [37].


Sil-net. A straight-forward U-net architecture is trained to map an RGB image to corresponding object silhouette, from which we sample points to generate used in the silhouette-shape constraint for inference. The network is learned in fully supervised manner using the binary cross entropy as loss function. Instead of training the Sil-net from the scratch, our framework can also use an off-the-shelf semantic segmentation network [26, 9] or fine-tune from such a network.

4 Experiments

We provide experimental results of our framework on single view object shape reconstruction and pose estimation on a real dataset and compare it against prior art. A series of ablation studies are also presented to analyze the effect of our inference procedure and novel probabilistic shape prior.

Figure 3: Qualitative Results on Pix3D: We compare shape prior based inference against the baseline (without optimization at inference time) and optimization without the probabilistic shape prior. From left to right: input image, predicted silhouette, ground-truth point cloud, baseline output, optimize without shape prior, full model.

4.1 Implementation Details

Data to train all networks is generated from the ShapeNet dataset [8]. We render synthetic images from the textured 3D CAD models from random viewpoints and overlay random background on the rendered objects to simulate the background clutter present in the real images. Background images are acquired from SUN dataset [55]. Random flips, crops and image color augmentations are used for training. To get the ground-truth 2D keypoint locations for training the KP-net, we use the 3D coordinates of corner points of the ground-truth shape’s bounding box. We transform and project the bounding box coordinates onto the image plane using the corresponding object pose which is used to render the image. The point cloud auto-encoder is trained for 400 epochs, the image regressor is trained for 25 epochs, the Sil-net and KP-net are trained end-to-end for 60 epochs and 25 epochs respectively. Adam optimizer [33] is used to train all these networks. The details about architecture for these neural networks are available in the supplementary material.

For inference, we parameterize the rotation with Euler angles because we empirically find this representation to be easily optimizable than quaternion or Lie-groups. 6-DoF object pose and the shape latent code are optimized jointly reusing Adam optimizer in Tensorflow [1]. It is important to note that we only update the latent code corresponding to inferred object shape and not the decoder network’s weights while doing gradient descent. The optimization is terminated when the rate of change in loss function diminishes or the maximum number of 2000 iterations is reached.

Figure 4: Interpolation on the latent space: We denote the likelihood of a shape explained by the GMM using colors. Red means high likelihood whereas blue represents low likelihood.
Methods CD EMD
Ours (baseline)
Table 1: 3D shape reconstruction results on Pix3D dataset

4.2 Testing Data and Evaluation Protocol

We take advantage of a recently published Pix3D [43] dataset – a dataset consisting diverse shapes aligned with corresponding real images. We choose the examples of chair class to demonstrate our method for two reasons. Firstly, chair class is a major component of this dataset and constitutes most shape diversity among all classes. There are 2894 images of chair corresponding 221 unique chair shapes to be used, which outnumbers other object classes greatly. For instance, table, being the second largest class, only has 738 images corresponding to 63 shapes. Secondly, the authors of Pix3D publish benchmark results for most of the state-of-the-art methods on the chair class, which facilitate convenient comparison of our work with the prior art.

Other alternatives to Pix3D which were used popularly were either purely synthetic datasets like ShapeNet [8], or PASCAL 3D+ [54] which was created by associating a small number of overlapping 3D CAD models per category for both training and test images of a particular object category to create pseudo ground truth. Due to this reason, as discussed in [48, 51] this dataset is not deemed fit for shape reconstruction evaluation recently.

We use 3D Chamfer Distance (CD as defined in (4)) and Earth Mover Distance (EMD as defined in (7)) as evaluation measures for shape reconstruction following the human study on the evaluation metrics for shape reconstruction presented in Pix3D which shows that these matrices are more correlated with human perception than IoU.


where is a bijection. We use the approximation implemented in Pix3D benchmark setup to calculate the EMD efficiently.

Following prior art, we evaluate pose using the MedErr: median of geodesic distance between the predicted rotation and ground-truth rotation defined in Equation (8) and the : percentage of geodesic distance smaller than .

Methods Shape Pose
w/ GT
Table 2: Quantitative analysis on the effect of silhouette-shape constraint and probabilistic shape prior at inference.

4.3 Shape Reconstruction on Pix3D

We compare the performance of our 3D shape prediction pipeline with a list of prior art including 3D-VAE-GAN [53], 3D-R2N2 [10], PSGN [17], DRC [48], AtlasNet [23], MarrNet [51] and Pix3D [52] which is a variant of MarrNet that decouples shape and pose prediction. All the methods use shapenet for training the respective structure prediction frameworks. The quantitative result is reported in Table 1. The performance evaluation for most of these methods listed above is taken from Pix3D paper [43].

Ours (baseline) in Table 1 refers to the one-shot deep network prediction without any optimization during inference time outlined in Section 3.1. It is notable that our baseline already outperforms the prior art. We believe that this performance boost is due to a unique combination of the best practices carefully picked from prior art to design our one-shot prediction baseline. For instance, our method predicts a canonical point cloud to represent shape. Other methods are either opting for in-pose point cloud representation [17], i.e., coupling the shape and pose or they work with volumetric or mesh representation for object shapes. In addition, similar to TL-embedding network [20] and 3D-VAE-GAN [53], we learn a latent space of 3D shape explicitly instead of a direct 2D to 3D mapping.

More importantly, the most significant point we want to emphasize from the results in Table 1 is that our probabilistic shape prior aware optimization method for shape and pose outperforms the very strong feed-forward baseline (along with other state-of-the-art methods) by a significant margin. Qualitative examples of our in-pose reconstruction can be found in the supplementary material.

4.4 Ablation Study on Shape prior

In this section, we analyze the effect of different components used in our shape and pose inference qualitatively (Table 2) and quantitatively (Figure 3). In particular, we demonstrate that the proposed probabilistic shape prior plays a crucial part in shape and pose estimation. To that end, we compare the one-shot shape prediction used as the baseline with simple optimization of pose and shape using silhouette, i.e., by minimizing and our full inference method which takes into account the likelihood of generated shape by using and .

Figure 5: Performance boost by using the shape-prior. We visualize the performance gap on shape reconstruction between optimization with and without shape prior at different IoU levels of the predicted silhouette. It is clear that the shape prior increases the robustness of the optimization to noisy silhouettes.

Figure 3 visualizes the input images, predicted silhouettes, and shape reconstruction given by different inference schemes. It is observed that after optimization using both shape prior and silhouette-shape constraint can lead the reconstruction closer to the ground-truth shape than that of one-shot prediction. Notably, if the shape prior is discarded, optimizing with alone is likely to generate some unnatural shapes, such as uneven chair seat, chair arms with different heights, and unrealistic short chair legs as highlighted by red boxes in the figure. These unrealistic reconstructions are in parts overfitted to noisy object silhouettes.

As for the quantitative comparison in Table 1, it can be seen that, minimizing silhouette-shape constraint alone leads to an improvement in the shape estimation in terms of CD, the EMD measure suggests that the estimated shape is inferior to the baseline. This is perhaps reflective of the fact the estimated reconstruction overfits to the CD metric. However, when we minimize the objective function which takes into account the likelihood of the learned latent embedding of shapes, the reconstruction performance improves on both measures significantly along with the reasonable boost in pose estimation. To put this performance improvement in perspective, we use the ground-truth silhouette annotations available in Pix3D to estimate pose and shape with silhouette-shape constraint only and evaluate the reconstruction performance to establish an upper bound for 3D shape inference from a single silhouette.

Further, we analyze this performance gap in reconstruction as the accuracy of the predicted silhouette changes in Figure 5. We plot the median CD and EMD respectively in left and right part of the figure against different accuracy levels of silhouette predictions which are quantified in terms of mean IOU. It is clear that shape prior performs significantly better in cases where the estimated silhouette have gross errors.

Our contention is that the learned shape prior avoids the regions of the latent space to restrict the inferred shape to be limited to realistic object-like solutions – thereby avoiding the overfitting to a noisy silhouettes. To further explore this hypothesis, we choose three points from the latent space corresponding to the mean of different Gaussian components and interpolate linearly in the latent space to generate intermediate sample chair shapes. The results are visualized in Figure 4. It can be clearly seen that the latent space learned using deep auto-encoder training constitutes of regions corresponding to very not chair-like shapes. However, these not chair-like shapes usually correspond to the high variance region of the learned GMM – for instance, between the latent code of a four-legged chair (mean_0) and that of a swivel chair (mean_1), the likelihood of an unrealistic shape as a mix of two structures is particularly low.

4.5 Pose Estimation using Keypoints

In terms of pose estimation, we compare our keypoint based approach with direct classification and regression approaches, which directly predict the rotation parameters in a classification or regression manner. To have a fair comparison for these approaches, we use the same training data and similar network architectures except for the last output layer of all these approaches because of different output dimensions. Our keypoint based method outperforms both regression and classification approaches as shown in Table 3. We are the first to show that the keypoint based method can also work with shape reconstruction without explicitly predicting 3D keypoints and can do better than classification or regression. This result is inline with other keypoint based methods [22, 52] that outperform direct prediction methods.

Methods MedErr
Table 3: Quantitative results on object pose estimation.

5 Discussion

Inspired by traditional shape prior based reconstruction methods from a single view, where the silhouette prediction and the 3D shape reconstruction are separated, we have presented a deep learning based framework that replaces/enhances the one-shot prediction of deep learning by iterative optimization. Our inference method not only reasons about the silhouette-shape constraint, but also uses a probabilistic shape prior learned on the latent shape space to avoid unrealistic reconstruction to achieve the state-of-the-art performance on a large dataset of real images.

The inference-as-optimization framework with explicit probabilistic shape prior can be easily extended to exploit information from multiple-views. This can be achieved by simply changing the inference loss to incorporate multiple silhouette consistency terms and additionally multi-view photometric consistency. Furthermore, frameworks like CodeSLAM, which uses latent code representation to facilitate dense bundle adjustment, would potentially benefit from the use of a probabilistic prior such as we have proposed.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  • [2] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3d point clouds. 2018.
  • [3] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building rome in a day. In Computer Vision, 2009 IEEE 12th International Conference on, pages 72–79. IEEE, 2009.
  • [4] S. Y. Bao, M. Chandraker, Y. Lin, and S. Savarese. Dense object reconstruction with semantic priors. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1264–1271, June 2013.
  • [5] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison. Codeslam-learning a compact, optimisable representation for dense visual slam. arXiv preprint arXiv:1804.00874, 2018.
  • [6] W. Brand. Morphable 3d models from video. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 2, pages II–II. IEEE, 2001.
  • [7] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3d shape from image streams. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 2, pages 690–696. IEEE, 2000.
  • [8] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  • [9] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
  • [10] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision, pages 628–644. Springer, 2016.
  • [11] Y. Dai, H. Li, and M. He. A simple prior-free method for non-rigid structure-from-motion factorization. International Journal of Computer Vision, 107(2):101–122, 2014.
  • [12] S. Dambreville, Y. Rathi, and A. Tannenbaum. A framework for image segmentation using shape models and kernel space shape priors. IEEE transactions on pattern analysis and machine intelligence, 30(8):1385–1399, 2008.
  • [13] A. Dame, V. A. Prisacariu, C. Y. Ren, and I. Reid. Dense reconstruction using 3d object shape priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1288–1295, 2013.
  • [14] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis & Machine Intelligence, (6):1052–1067, 2007.
  • [15] F. Engelmann, J. Stückler, and B. Leibe. Joint object pose estimation and shape reconstruction in urban street scenes using 3d shape priors. In German Conference on Pattern Recognition, pages 219–230. Springer, 2016.
  • [16] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [17] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR, volume 2, page 6, 2017.
  • [18] J. Gall, B. Rosenhahn, and H.-P. Seidel. Clustered stochastic optimization for object recognition and pose estimation. In Joint Pattern Recognition Symposium, pages 32–41. Springer, 2007.
  • [19] R. Garg, A. Roussos, and L. Agapito. Dense variational reconstruction of non-rigid surfaces from monocular video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1272–1279, 2013.
  • [20] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In European Conference on Computer Vision, pages 484–499. Springer, 2016.
  • [21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [22] A. Grabner, P. M. Roth, and V. Lepetit. 3d pose estimation and 3d model retrieval for objects in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3022–3031, 2018.
  • [23] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry. Atlasnet: A papier-m^ ach’e approach to learning 3d surface generation. arXiv preprint arXiv:1802.05384, 2018.
  • [24] J. Gwak, C. B. Choy, M. Chandraker, A. Garg, and S. Savarese. Weakly supervised 3d reconstruction with adversarial constraint. In 3D Vision (3DV), 2017 International Conference on, pages 263–272. IEEE, 2017.
  • [25] C. Häne, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object reconstruction. In 3D Vision (3DV), 2017 International Conference on, pages 412–420. IEEE, 2017.
  • [26] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
  • [27] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
  • [28] B. K. Horn. Obtaining shape from shading information. MIT press, 1989.
  • [29] E. Insafutdinov and A. Dosovitskiy. Unsupervised learning of shape and pose with differentiable point clouds. arXiv preprint arXiv:1810.09381, 2018.
  • [30] A. Johnston, R. Garg, G. Carneiro, I. D. Reid, and A. van den Hengel. Scaling cnns for high resolution volumetric reconstruction from a single image. In ICCV Workshops, pages 930–939, 2017.
  • [31] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Category-specific object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1966–1974, 2015.
  • [32] H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3907–3916, 2018.
  • [33] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [34] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [35] K. Li, T. Pham, H. Zhan, and I. Reid. Efficient dense point cloud object reconstruction using deformation vector fields. In Proceedings of the European Conference on Computer Vision (ECCV), pages 497–513, 2018.
  • [36] C.-H. Lin, C. Kong, and S. Lucey. Learning efficient point cloud generation for dense 3d object reconstruction. arXiv preprint arXiv:1706.07036, 2017.
  • [37] J. J. Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical analysis, pages 105–116. Springer, 1978.
  • [38] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in real-time. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2320–2327. IEEE, 2011.
  • [39] V. A. Prisacariu and I. Reid. Shared shape spaces. In Proceedings of the 2011 International Conference on Computer Vision, pages 2587–2594. IEEE Computer Society, 2011.
  • [40] V. A. Prisacariu, A. V. Segal, and I. Reid. Simultaneous monocular 2d segmentation, 3d pose recovery and 3d reconstruction. In Asian conference on computer vision, pages 593–606. Springer, 2012.
  • [41] B. Rosenhahn, T. Brox, and J. Weickert. Three-dimensional shape knowledge for joint image segmentation and pose tracking. International Journal of Computer Vision, 73(3):243–262, 2007.
  • [42] C. Schmaltz, B. Rosenhahn, T. Brox, J. Weickert, D. Cremers, L. Wietzke, and G. Sommer. Occlusion modeling by tracking multiple objects. In Joint Pattern Recognition Symposium, pages 173–183. Springer, 2007.
  • [43] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2974–2983, 2018.
  • [44] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multi-view 3d models from single images with a convolutional network. In European Conference on Computer Vision, pages 322–337. Springer, 2016.
  • [45] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), volume 2, page 8, 2017.
  • [46] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2):137–154, 1992.
  • [47] S. Tulsiani, A. A. Efros, and J. Malik. Multi-view consistency as supervisory signal for learning shape and pose prediction. Computer Vision and Pattern Regognition (CVPR), 2018.
  • [48] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, volume 1, page 3, 2017.
  • [49] S. Vicente, J. Carreira, L. Agapito, and J. Batista. Reconstructing pascal voc. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 41–48, 2014.
  • [50] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
  • [51] J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances in neural information processing systems, pages 540–550, 2017.
  • [52] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. 3d interpreter networks for viewer-centered wireframe modeling. International Journal of Computer Vision, pages 1–18, 2018.
  • [53] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
  • [54] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pages 75–82. IEEE, 2014.
  • [55] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva. Sun database: Exploring a large collection of scene categories. International Journal of Computer Vision, 119(1):3–22, 2016.
  • [56] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In Advances in Neural Information Processing Systems, pages 1696–1704, 2016.
  • [57] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Interpretable unsupervised learning on 3d point clouds. arXiv preprint arXiv:1712.07262, 2017.
  • [58] R. Zhu, H. K. Galoogahi, C. Wang, and S. Lucey. Rethinking reprojection: Closing the loop for pose-aware shape reconstruction from a single image. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 57–65. IEEE, 2017.
  • [59] R. Zhu, C. Wang, C.-H. Lin, Z. Wang, and S. Lucey. Object-centric photometric bundle adjustment with deep shape prior. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 894–902. IEEE, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description