Optimizable Object Reconstruction from a Single View
Abstract
3D shape reconstruction from a single image is a highly illposed problem. A number of current deep learning based systems aim to solve the shape reconstruction and shape pose problems by learning an endtoend network to perform feedforward inference. More traditional (nondeep learning) methods cast the problem in an iterative optimization framework. In this paper, inspired by these more traditional shapepriorbased approaches, which separate the 2D recognition and 3D reconstruction, we develop a system that leverages the power of both feedforward and iterative approaches. Our framework uses the power of deep learning to capture 3D shape information from training data and provide high quality initialization, while allowing both image evidence and shape priors to influence iterative refinement at inference time. Specifically, we employ an autoencoder to learn a latent space of object shapes, a CNN that maps an image to the latent space, another CNN to predict 2D keypoints to recover object pose using PnP, and a segmentation network to predict an object’s silhouette from an RGB image. At inference time these components provide high quality initial estimates of the shape and pose, which are then further optimized based on the silhouetteshape constraint and a probabilistic shape prior learned on the latent space (see Figure 1). Our experiments show that this optimizable inference framework achieves stateoftheart results on a large benchmarking dataset with real images.
1 Introduction
Humans effortlessly infer the 3D structure of objects from a single image, including the parts of the object which are not directly visible in the image. It is obvious that the human perception of object geometry involves, in large part, the vast experience of having seen the objects of a category from multipleviews, having reasoned about their geometry and having build rich prior knowledge about meaningful object structure for specific object category. However single view object shape reconstruction remains a challenging problem for modern vision algorithms – perhaps because of the difficulty of capturing and representing this prior shape information – and the solutions to 3D geometric perception have been traditionally dominated by multiview reconstruction pipelines.
The recent successes of deep learning have motivated researchers to depart from traditional multiview reconstruction methods and directly learn the image to objectshape mapping using convolution neural networks [20, 10, 17, 48, 47, 53, 51]. These feedforward networks show impressive results, with prior information about shape captured at training time and represented implicitly in the network’s weights. An alternative way to reason about the 3D geometry of the object is by combining the 2D image cues (for example object silhouette) with explicitly learned priors about the object shape – thereby separating the 2D recognition from 3D reconstruction. This two stage framework has been successfully used for inferring 3D shape from images prior to the resurgence of deep learning by [40, 39] in singleview and [4, 13] in multiview reconstruction frameworks.
For example, [40, 39] propose to estimate the 3D object shape and pose given a single silhouette. To make this highly illposed problem wellposed, these works learned a probabilistic latent space embedding using a Gaussian Process Latent Variable Model (GPLVM). Object shape and pose inference (reconstruction) was formulated as an optimization problem to find the most likely shape (i.e. generated from the low variance area of the learned latent space) which can be projected using the optimal pose to the image plane inducing object segmentation consistent with the one produced by a recognition method. Using nonlinear dimensional reduction and an inference method which successfully accounts for the variance of the generated shape, [40, 39] achieved the stateoftheart reconstructions for simple object like cars in presence of heavy occasion and background clutter. Furthermore the iterative nature of these algorithms leads very naturally – almost trivially – to extension to a tracking framework of temporal refinement as pose or even shape change dynamically [40].
These methods worked quite well for object categories with small shape variations and which lack high frequency detail – like cars or human bodies – and they can produce dense and accurate volumetric reconstruction. However more complex objects like chairs, with larger shape variation and thin structures, were significant failure cases. Some of the limitations arise due to the use of GPLVM for dimensionality reduction. GPLVM does not scale well to be trained on large datasets or to learn priors on highdimensional shape representations. This forces the [40] to use frequency based compression techniques for the representing shapes leading to loss of thin and other high frequency object structures.
Deep learning has emerged as an attractive alternative to address these issues, and we explore the possibilities in this paper. In particular, we design a method that uses deep neural nets to initialize object shape as a latent code and pose, which are at inference time optimized in a manner analogous to traditional iterative approaches; i.e. by optimizing the alignment between the image silhouette and the expected projection of the object, guided by a shape prior term captured via a deep autoencoder. The autoencoder’s latent space is given a probabilistic interpretation using a Gaussian Mixture Model (GMM) on the latent codes; we show that this yields a learnable, simple but effective probabilistic prior. The image silhouette is obtained using a conventional segmentation network, and the learned shape prior helps avoid unrealistic shapes (because these correspond to low confidence regions of the GMM). An illustration of the framework can be found in Figure 2.
In summary, our contributions are as follows:

Based on current bestpractice, we design networks for shape reconstruction and pose estimation that we demonstrate lead to stateoftheart reconstruction results even in purely feedforward operation.

Our feedforward imagetoshape regression explicitly includes a latent shape space created using a deep autoencoder, using the power of deep learning for successful dimensionality reduction. We show that a simple GMM fitting to this latent space can be used as an effective probabilistic shape prior.

We combine these properties to yield the first fully deep shapeprior aware inference method for inpose object reconstruction. We show that this inferenceasoptimization – essentially a deep version of [39] – improves the previous state of the art by a large margin.
We would like to distinguish our work from two recently proposed works of MarrNet [51] and Zhu et al.[59]. In these works, the authors also propose to refine the shape prediction from a neural network to fitting the final reconstruction into a predicted silhouette. However, neither of the above mentioned frameworks uses a prelearned probabilistic shape prior for the purpose of their refinement. As a result, unrealistic shapes may be generated and damage the reconstruction quality. We empirically show the importance of the proposed shape prior in our ablation study to show a clear advantage of the proposed method against these works.
2 Related Works
3D understanding of object shapes from images is considered an important step in scene perception. While mapping an environment using Structure from Motion [46, 3] and SLAM [14, 38] facilitates localization and navigation, a higher level of understanding about objects in terms of their shape and relative position with respect to the rest of the background is essential for their manipulation. Initial works, to localize objects while estimating its pose from an image, are limited to the case where a prescanned object structure is available [41, 18, 42].
Soon it is realized that the valid 3D shapes of objects belonging to a specific category are highly correlated and dimensionality reduction emerges as a prominent tool to model object shapes. Works like [49] reconstruct traditional image segmentation datasets like PASCAL VOC [16] by extending the ideas of nonrigid Structure from Motion [7, 6, 11, 19]. Methods like [31] propose to use these learned category specific object shape manifolds to reconstruct shape of the object from single image by fitting the reconstructed shape to a single image silhouette and refine it with simple image based methods like shape from shading [28].
While the methods mentioned above attempt to learn linear object shape manifolds from 2D keypoint correspondences or object segmentation annotations, increasing availability of class specific 3D CAD models and object scans allow researchers to learn complex manifolds for shapes within an object category directly from data. For example, Gaussian Process Latent Variable Model or Kernel/simple Principal Component Analysis is employed to learn a compact latent space of object shapes in [12, 40, 13, 15]. Taking advantage of modern deep learning techniques, deep autoencoders [27, 50], VAEs [34], and GANs [21, 53] can be better alternatives to model object shapes than traditional dimensionality reduction tools.
Recently, with the advance of deep learning, there are many endtoend systems proposed to solve the object shape reconstruction problem from a single RGB image directly. Different representations are adopted to represent 3D shapes in the deep learning based methods. Volumetric representation is dominant in the early works [10, 20, 24, 48, 51] because it is considered a natural extension of dense 2D segmentation. 3D deconvolutions replace traditional 2D convolutions in the CNNs which are used for image segmentation. Octree [25, 45] and Discrete Cosine Transform [30] are proposed to reduce the memory requirement for predicting highresolution volumetric grid. Other representations such as Mesh [32, 57], mutliview depth maps [44, 36, 35], and point cloud [17, 29] are also explored. It is shown in [17, 29] that point cloud representation is more efficient than volumetric representation and simpler than mesh and Octree in terms of designing a suitable network architecture and training loss.
Another crucial aspect that affects the performance of a deep learning solution for shape reconstruction is the role of object pose. One stream of research [17, 51] proposes to incorporate pose implicitly into the shape reconstruction network where the predicted shape is aligned with the input image. In another stream, it is shown in [58, 43] that decoupling the shape and pose while reasoning about object structure from single image has performance benefits [58, 43]. For instance, a canonical shape embedding in lowdimensional space is employed in [20, 53]. Besides the fullysupervised training used in [20, 10, 17], researchers also explore weakly supervised training of the network without the explicit 3D loss. For instance, In Yan et al.[56], Tulsiani et al.[48], a 3D shape generated by the network is supervised by its projections at multiple viewpoints with the corresponding groundtruth silhouettes. In particular, Zhu et al.[58] mitigate the domain gap between synthetic images used in training and real images at test by finetuning their network on a small amount of real images with groundtruth silhouette annotation. The weakly supervised training loss shares similarity with our silhouetteshape constraint during our inference.
Following the best practices in deep learning regime, we choose to use point cloud as shape representation, decouple shape and pose by reconstructing 3D shapes in a canonical pose and estimating object pose separately, and a twostep training schedule for shape reconstruction network used TLembedding network [20], where a latent space for 3D shapes is learned explicitly via an autoencoder and then an image regressor that maps an RGB image to the learned latent space is trained. On top of these deep learning components, we use the predicted silhouette and learned shape prior to transform the oneshot prediction from deep networks to an optimization at inference time.
Similar to our framework, CodeSLAM [5], MarrNet [51], and Zhu et al.[59] also use a lowdimensional latent space to represent the geometry information. For the scenelevel geometry reconstruction, CodeSLAM uses nearby views to form a photometric loss, which is minimized by searching in the learned latent space of depth maps. MarrNet [52] and Zhu et al.[59] enforce the 2.5D (silhouette, depth or normal)  3D shape constraint and photometric loss respectively to search in the latent space of object shapes. Nevertheless, a common disadvantage of these works, which we address in our work, is that they do not consider a probabilistic prior in the latent space explicitly when they perform optimization, which we show is essential for the optimization at inference time through our ablation study.
3 Method
The goal of this work is to reconstruct a 3D shape of an object and its pose with respect to the camera from a single image. Our framework closely follows traditional object prior based 3D reconstruction methods by framing the reconstruction as an optimization problem with respect to the object shape and pose assuming that a deep neural network provides us with the object silhouette. This inference method is described in the Section 3.1. To facilitate the inference, we require a lowdimensional latent space which encodes the object shape, a learned probabilistic prior on this latent space providing us with the likelihood of any point in the latent space, a single view object silhouette prediction to guide the shape and pose optimization and initialization for object shape and pose. How we learn the relevant prerequisites for inference is elaborated in Section 3.2.
3.1 Inferenceasoptimization
For inference, we are given with sampled points from the predicted silhouette , initial latent code of the object shape , initial pose (in terms of rotation and translation ) each predicted using neural nets. Additionally, the learned probabilistic prior on the latent space is available in the form of a GMM.
We minimize the objective function defined below with respect to the shape and pose variables , and :
(1) 
Here corresponds to the silhouetteshape constraint and is the term related to the probabilistic shape prior, which we explain in the rest of this section. is the weighting factor between these two loss terms.
Silhouetteshape Constraint Term. It enforces the projection of a transformed 3D shape to be consistent with the object’s silhouette on the image. While using point cloud representation, this can be achieved by defining a 2D Chamfer loss between projection of the transformed 3D point cloud and sampled points of the predicted silhouette shown in Equation (2),
(2) 
where is a point cloud reconstruction in a predefined canonical pose generated by a latent code through the point cloud decoder. is a transformation matrix that is composed of , camera intrinsic matrix , object rotation and its translation . is a set of 2D sampled points from a predicted silhouette.
Shape Prior Term. Because of the highly unconstrained nature of the silhouetteshape constraint and incorrect silhouette, we employ a shape prior term, which quantifies the likelihood of a latent code using a probabilistic model. This can mitigates the problem of generating unrealistic 3D shapes. Specifically, the shape prior minimizes the negative log likelihood of a latent code in the GMM, which can be written as:
(3) 
where , , and are the weight, mean, and covariance matrix of the Gaussian component, and is the number of Gaussian components.
3.2 Learning
As explained in Section 3.1, our inference method requires learning a probabilistic lowdimensional representation of object shape from data and three networks predicting object silhouette, initial object pose and object shape in the form of a latent code. In this section we describe how we learn these components.
3.2.1 Learning Shape Latent Space and Prior
Our inference method heavily rely on learning a lowdimensional latent space of shape from which a decoder network can map a latent code to a unique 3D point cloud . To learn this mapping, we train an encoderdecoder network. Similar to a conventional autoencoder, our encoder maps a point cloud to its corresponding low dimensional latent code , whereas the decoder reconstructs the point cloud from the latent code . We train this autoencoder using 3D Chamfer loss as defined below:
(4) 
To learn a probabilistic model as the shape prior, we fit a GMM to the learned latent space. Specifically, we map the point clouds from the training set to the latent space using the point cloud encoder, to get a set of latent codes . The parameters of GMM with components are then learned to maximize the likelihood of the training set’s latent codes by using an EM algorithm to optimize the following objective:
(5) 
where , , and are the weight, mean and covariance matrix of the Gaussian component, and and are the number of sample latent codes and the number of Gaussian component.
There are a number of generative models that can be used as a shape prior. We deploy GMM with simple autoencoder following the recent finding of [2] which empirically show that, in the form of point cloud, among variants of GAN, and GMM, the latter outperforms other models in generalization and diversity of shapes, meaning that the GMM captures a better distribution of object shapes.
3.2.2 Learning segmentation and initialization
Our inference requires three networks each taking an RGB image containing the object of interest as the input. These networks are: (i) 3Dnet: predicts the latent code corresponding to an image which can be used to initialize the object shape using the decoder . (ii) KP (KeyPoint)net: predicts the projection locations of 8 corner points of the object 3D bounding box, which are used to recover the initial object pose and with respect to camera. (iii) Silnet: predicts the object’s silhouette in the image which guides the inference. In this section we outline the procedure adapted in this work to train these components.
3Dnet. This network consists of two parts. An image regressor learns to map an image to the latent code which is decoded by the learned introduced in Section 3.2.1 to generate a 3D point cloud. During training, when the 3D point cloud corresponding to the image is available, the image regressor is trained by minimizing the error between its prediction and the output of the trained given the groundtruth 3D point cloud. During inference, the image regressor maps the input image to a latent code which is used as an initialization of object shape in the latent space.
KPnet. We define eight corner points of the object’s 3D bounding box in the predefined canonical pose as 3D keypoints. The KPnet learns to predict the projection locations of these eight points onto the image plane using the ground truth object pose available at the training time. loss between the groundtruth 2D projections of the 3D keypoints and the prediction is employed to supervise the training.
The 2D3D keypoint correspondence is used to recover the object pose with respect to the camera, which is used as the initialization for inference (Section 3.1). More specifically, during inference time, we retrieve the 3D corner points from the point cloud reconstruction of the 3Dnet, together with the KPnet prediction to form the 2D3D correspondence. Object pose estimation is then formulated as a PnP problem as shown in Equation (6) and solved iteratively using Levenberg Marquardt algorithm [37].
(6) 
Silnet. A straightforward Unet architecture is trained to map an RGB image to corresponding object silhouette, from which we sample points to generate used in the silhouetteshape constraint for inference. The network is learned in fully supervised manner using the binary cross entropy as loss function. Instead of training the Silnet from the scratch, our framework can also use an offtheshelf semantic segmentation network [26, 9] or finetune from such a network.
4 Experiments
We provide experimental results of our framework on single view object shape reconstruction and pose estimation on a real dataset and compare it against prior art. A series of ablation studies are also presented to analyze the effect of our inference procedure and novel probabilistic shape prior.
4.1 Implementation Details
Data to train all networks is generated from the ShapeNet dataset [8]. We render synthetic images from the textured 3D CAD models from random viewpoints and overlay random background on the rendered objects to simulate the background clutter present in the real images. Background images are acquired from SUN dataset [55]. Random flips, crops and image color augmentations are used for training. To get the groundtruth 2D keypoint locations for training the KPnet, we use the 3D coordinates of corner points of the groundtruth shape’s bounding box. We transform and project the bounding box coordinates onto the image plane using the corresponding object pose which is used to render the image. The point cloud autoencoder is trained for 400 epochs, the image regressor is trained for 25 epochs, the Silnet and KPnet are trained endtoend for 60 epochs and 25 epochs respectively. Adam optimizer [33] is used to train all these networks. The details about architecture for these neural networks are available in the supplementary material.
For inference, we parameterize the rotation with Euler angles because we empirically find this representation to be easily optimizable than quaternion or Liegroups. 6DoF object pose and the shape latent code are optimized jointly reusing Adam optimizer in Tensorflow [1]. It is important to note that we only update the latent code corresponding to inferred object shape and not the decoder network’s weights while doing gradient descent. The optimization is terminated when the rate of change in loss function diminishes or the maximum number of 2000 iterations is reached.
Methods  CD  EMD 
3DR2N2  
PSGN  
3DVAEGAN  
DRC  
MarrNet  
AtlasNet  
Pix3D  
Ours (baseline)  
Ours 
4.2 Testing Data and Evaluation Protocol
We take advantage of a recently published Pix3D [43] dataset – a dataset consisting diverse shapes aligned with corresponding real images. We choose the examples of chair class to demonstrate our method for two reasons. Firstly, chair class is a major component of this dataset and constitutes most shape diversity among all classes. There are 2894 images of chair corresponding 221 unique chair shapes to be used, which outnumbers other object classes greatly. For instance, table, being the second largest class, only has 738 images corresponding to 63 shapes. Secondly, the authors of Pix3D publish benchmark results for most of the stateoftheart methods on the chair class, which facilitate convenient comparison of our work with the prior art.
Other alternatives to Pix3D which were used popularly were either purely synthetic datasets like ShapeNet [8], or PASCAL 3D+ [54] which was created by associating a small number of overlapping 3D CAD models per category for both training and test images of a particular object category to create pseudo ground truth. Due to this reason, as discussed in [48, 51] this dataset is not deemed fit for shape reconstruction evaluation recently.
We use 3D Chamfer Distance (CD as defined in (4)) and Earth Mover Distance (EMD as defined in (7)) as evaluation measures for shape reconstruction following the human study on the evaluation metrics for shape reconstruction presented in Pix3D which shows that these matrices are more correlated with human perception than IoU.
(7) 
where is a bijection. We use the approximation implemented in Pix3D benchmark setup to calculate the EMD efficiently.
Following prior art, we evaluate pose using the MedErr: median of geodesic distance between the predicted rotation and groundtruth rotation defined in Equation (8) and the : percentage of geodesic distance smaller than .
(8) 
Methods  Shape  Pose  
CD  EMD  MedErr  
baseline  
only  

4.3 Shape Reconstruction on Pix3D
We compare the performance of our 3D shape prediction pipeline with a list of prior art including 3DVAEGAN [53], 3DR2N2 [10], PSGN [17], DRC [48], AtlasNet [23], MarrNet [51] and Pix3D [52] which is a variant of MarrNet that decouples shape and pose prediction. All the methods use shapenet for training the respective structure prediction frameworks. The quantitative result is reported in Table 1. The performance evaluation for most of these methods listed above is taken from Pix3D paper [43].
Ours (baseline) in Table 1 refers to the oneshot deep network prediction without any optimization during inference time outlined in Section 3.1. It is notable that our baseline already outperforms the prior art. We believe that this performance boost is due to a unique combination of the best practices carefully picked from prior art to design our oneshot prediction baseline. For instance, our method predicts a canonical point cloud to represent shape. Other methods are either opting for inpose point cloud representation [17], i.e., coupling the shape and pose or they work with volumetric or mesh representation for object shapes. In addition, similar to TLembedding network [20] and 3DVAEGAN [53], we learn a latent space of 3D shape explicitly instead of a direct 2D to 3D mapping.
More importantly, the most significant point we want to emphasize from the results in Table 1 is that our probabilistic shape prior aware optimization method for shape and pose outperforms the very strong feedforward baseline (along with other stateoftheart methods) by a significant margin. Qualitative examples of our inpose reconstruction can be found in the supplementary material.
4.4 Ablation Study on Shape prior
In this section, we analyze the effect of different components used in our shape and pose inference qualitatively (Table 2) and quantitatively (Figure 3). In particular, we demonstrate that the proposed probabilistic shape prior plays a crucial part in shape and pose estimation. To that end, we compare the oneshot shape prediction used as the baseline with simple optimization of pose and shape using silhouette, i.e., by minimizing and our full inference method which takes into account the likelihood of generated shape by using and .
Figure 3 visualizes the input images, predicted silhouettes, and shape reconstruction given by different inference schemes. It is observed that after optimization using both shape prior and silhouetteshape constraint can lead the reconstruction closer to the groundtruth shape than that of oneshot prediction. Notably, if the shape prior is discarded, optimizing with alone is likely to generate some unnatural shapes, such as uneven chair seat, chair arms with different heights, and unrealistic short chair legs as highlighted by red boxes in the figure. These unrealistic reconstructions are in parts overfitted to noisy object silhouettes.
As for the quantitative comparison in Table 1, it can be seen that, minimizing silhouetteshape constraint alone leads to an improvement in the shape estimation in terms of CD, the EMD measure suggests that the estimated shape is inferior to the baseline. This is perhaps reflective of the fact the estimated reconstruction overfits to the CD metric. However, when we minimize the objective function which takes into account the likelihood of the learned latent embedding of shapes, the reconstruction performance improves on both measures significantly along with the reasonable boost in pose estimation. To put this performance improvement in perspective, we use the groundtruth silhouette annotations available in Pix3D to estimate pose and shape with silhouetteshape constraint only and evaluate the reconstruction performance to establish an upper bound for 3D shape inference from a single silhouette.
Further, we analyze this performance gap in reconstruction as the accuracy of the predicted silhouette changes in Figure 5. We plot the median CD and EMD respectively in left and right part of the figure against different accuracy levels of silhouette predictions which are quantified in terms of mean IOU. It is clear that shape prior performs significantly better in cases where the estimated silhouette have gross errors.
Our contention is that the learned shape prior avoids the regions of the latent space to restrict the inferred shape to be limited to realistic objectlike solutions – thereby avoiding the overfitting to a noisy silhouettes. To further explore this hypothesis, we choose three points from the latent space corresponding to the mean of different Gaussian components and interpolate linearly in the latent space to generate intermediate sample chair shapes. The results are visualized in Figure 4. It can be clearly seen that the latent space learned using deep autoencoder training constitutes of regions corresponding to very not chairlike shapes. However, these not chairlike shapes usually correspond to the high variance region of the learned GMM – for instance, between the latent code of a fourlegged chair (mean_0) and that of a swivel chair (mean_1), the likelihood of an unrealistic shape as a mix of two structures is particularly low.
4.5 Pose Estimation using Keypoints
In terms of pose estimation, we compare our keypoint based approach with direct classification and regression approaches, which directly predict the rotation parameters in a classification or regression manner. To have a fair comparison for these approaches, we use the same training data and similar network architectures except for the last output layer of all these approaches because of different output dimensions. Our keypoint based method outperforms both regression and classification approaches as shown in Table 3. We are the first to show that the keypoint based method can also work with shape reconstruction without explicitly predicting 3D keypoints and can do better than classification or regression. This result is inline with other keypoint based methods [22, 52] that outperform direct prediction methods.
Methods  MedErr  
Regression  
Classification  
Keypoint 
5 Discussion
Inspired by traditional shape prior based reconstruction methods from a single view, where the silhouette prediction and the 3D shape reconstruction are separated, we have presented a deep learning based framework that replaces/enhances the oneshot prediction of deep learning by iterative optimization. Our inference method not only reasons about the silhouetteshape constraint, but also uses a probabilistic shape prior learned on the latent shape space to avoid unrealistic reconstruction to achieve the stateoftheart performance on a large dataset of real images.
The inferenceasoptimization framework with explicit probabilistic shape prior can be easily extended to exploit information from multipleviews. This can be achieved by simply changing the inference loss to incorporate multiple silhouette consistency terms and additionally multiview photometric consistency. Furthermore, frameworks like CodeSLAM, which uses latent code representation to facilitate dense bundle adjustment, would potentially benefit from the use of a probabilistic prior such as we have proposed.
References
 [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [2] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3d point clouds. 2018.
 [3] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building rome in a day. In Computer Vision, 2009 IEEE 12th International Conference on, pages 72–79. IEEE, 2009.
 [4] S. Y. Bao, M. Chandraker, Y. Lin, and S. Savarese. Dense object reconstruction with semantic priors. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1264–1271, June 2013.
 [5] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison. Codeslamlearning a compact, optimisable representation for dense visual slam. arXiv preprint arXiv:1804.00874, 2018.
 [6] W. Brand. Morphable 3d models from video. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 2, pages II–II. IEEE, 2001.
 [7] C. Bregler, A. Hertzmann, and H. Biermann. Recovering nonrigid 3d shape from image streams. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 2, pages 690–696. IEEE, 2000.
 [8] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
 [9] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
 [10] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. In European conference on computer vision, pages 628–644. Springer, 2016.
 [11] Y. Dai, H. Li, and M. He. A simple priorfree method for nonrigid structurefrommotion factorization. International Journal of Computer Vision, 107(2):101–122, 2014.
 [12] S. Dambreville, Y. Rathi, and A. Tannenbaum. A framework for image segmentation using shape models and kernel space shape priors. IEEE transactions on pattern analysis and machine intelligence, 30(8):1385–1399, 2008.
 [13] A. Dame, V. A. Prisacariu, C. Y. Ren, and I. Reid. Dense reconstruction using 3d object shape priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1288–1295, 2013.
 [14] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Realtime single camera slam. IEEE Transactions on Pattern Analysis & Machine Intelligence, (6):1052–1067, 2007.
 [15] F. Engelmann, J. Stückler, and B. Leibe. Joint object pose estimation and shape reconstruction in urban street scenes using 3d shape priors. In German Conference on Pattern Recognition, pages 219–230. Springer, 2016.
 [16] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
 [17] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR, volume 2, page 6, 2017.
 [18] J. Gall, B. Rosenhahn, and H.P. Seidel. Clustered stochastic optimization for object recognition and pose estimation. In Joint Pattern Recognition Symposium, pages 32–41. Springer, 2007.
 [19] R. Garg, A. Roussos, and L. Agapito. Dense variational reconstruction of nonrigid surfaces from monocular video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1272–1279, 2013.
 [20] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In European Conference on Computer Vision, pages 484–499. Springer, 2016.
 [21] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [22] A. Grabner, P. M. Roth, and V. Lepetit. 3d pose estimation and 3d model retrieval for objects in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3022–3031, 2018.
 [23] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry. Atlasnet: A papierm^ ach’e approach to learning 3d surface generation. arXiv preprint arXiv:1802.05384, 2018.
 [24] J. Gwak, C. B. Choy, M. Chandraker, A. Garg, and S. Savarese. Weakly supervised 3d reconstruction with adversarial constraint. In 3D Vision (3DV), 2017 International Conference on, pages 263–272. IEEE, 2017.
 [25] C. Häne, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object reconstruction. In 3D Vision (3DV), 2017 International Conference on, pages 412–420. IEEE, 2017.
 [26] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
 [27] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
 [28] B. K. Horn. Obtaining shape from shading information. MIT press, 1989.
 [29] E. Insafutdinov and A. Dosovitskiy. Unsupervised learning of shape and pose with differentiable point clouds. arXiv preprint arXiv:1810.09381, 2018.
 [30] A. Johnston, R. Garg, G. Carneiro, I. D. Reid, and A. van den Hengel. Scaling cnns for high resolution volumetric reconstruction from a single image. In ICCV Workshops, pages 930–939, 2017.
 [31] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Categoryspecific object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1966–1974, 2015.
 [32] H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3907–3916, 2018.
 [33] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [34] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [35] K. Li, T. Pham, H. Zhan, and I. Reid. Efficient dense point cloud object reconstruction using deformation vector fields. In Proceedings of the European Conference on Computer Vision (ECCV), pages 497–513, 2018.
 [36] C.H. Lin, C. Kong, and S. Lucey. Learning efficient point cloud generation for dense 3d object reconstruction. arXiv preprint arXiv:1706.07036, 2017.
 [37] J. J. Moré. The levenbergmarquardt algorithm: implementation and theory. In Numerical analysis, pages 105–116. Springer, 1978.
 [38] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in realtime. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2320–2327. IEEE, 2011.
 [39] V. A. Prisacariu and I. Reid. Shared shape spaces. In Proceedings of the 2011 International Conference on Computer Vision, pages 2587–2594. IEEE Computer Society, 2011.
 [40] V. A. Prisacariu, A. V. Segal, and I. Reid. Simultaneous monocular 2d segmentation, 3d pose recovery and 3d reconstruction. In Asian conference on computer vision, pages 593–606. Springer, 2012.
 [41] B. Rosenhahn, T. Brox, and J. Weickert. Threedimensional shape knowledge for joint image segmentation and pose tracking. International Journal of Computer Vision, 73(3):243–262, 2007.
 [42] C. Schmaltz, B. Rosenhahn, T. Brox, J. Weickert, D. Cremers, L. Wietzke, and G. Sommer. Occlusion modeling by tracking multiple objects. In Joint Pattern Recognition Symposium, pages 173–183. Springer, 2007.
 [43] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman. Pix3d: Dataset and methods for singleimage 3d shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2974–2983, 2018.
 [44] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multiview 3d models from single images with a convolutional network. In European Conference on Computer Vision, pages 322–337. Springer, 2016.
 [45] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for highresolution 3d outputs. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), volume 2, page 8, 2017.
 [46] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2):137–154, 1992.
 [47] S. Tulsiani, A. A. Efros, and J. Malik. Multiview consistency as supervisory signal for learning shape and pose prediction. Computer Vision and Pattern Regognition (CVPR), 2018.
 [48] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multiview supervision for singleview reconstruction via differentiable ray consistency. In CVPR, volume 1, page 3, 2017.
 [49] S. Vicente, J. Carreira, L. Agapito, and J. Batista. Reconstructing pascal voc. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 41–48, 2014.
 [50] P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
 [51] J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances in neural information processing systems, pages 540–550, 2017.
 [52] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. 3d interpreter networks for viewercentered wireframe modeling. International Journal of Computer Vision, pages 1–18, 2018.
 [53] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
 [54] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pages 75–82. IEEE, 2014.
 [55] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva. Sun database: Exploring a large collection of scene categories. International Journal of Computer Vision, 119(1):3–22, 2016.
 [56] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning singleview 3d object reconstruction without 3d supervision. In Advances in Neural Information Processing Systems, pages 1696–1704, 2016.
 [57] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Interpretable unsupervised learning on 3d point clouds. arXiv preprint arXiv:1712.07262, 2017.
 [58] R. Zhu, H. K. Galoogahi, C. Wang, and S. Lucey. Rethinking reprojection: Closing the loop for poseaware shape reconstruction from a single image. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 57–65. IEEE, 2017.
 [59] R. Zhu, C. Wang, C.H. Lin, Z. Wang, and S. Lucey. Objectcentric photometric bundle adjustment with deep shape prior. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 894–902. IEEE, 2018.