3D Shape Reconstruction from a Single 2D Image via 2D3D SelfConsistency
Abstract
Aiming at inferring 3D shapes from 2D images, 3D shape reconstruction has drawn huge attention from researchers in computer vision and deep learning communities. However, it is not practical to assume that 2D input images and their associated ground truth 3D shapes are always available during training. In this paper, we propose a framework for semisupervised 3D reconstruction. This is realized by our introduced 2D3D selfconsistency, which aligns the predicted 3D models and the projected 2D foreground segmentation masks. Moreover, our model not only enables recovering 3D shapes with the corresponding 2D masks, camera pose information can be jointly disentangled and predicted, even such supervision is never available during training. In the experiments, we qualitatively and quantitatively demonstrate the effectiveness of our model, which performs favorably against stateoftheart approaches in either supervised or semisupervised settings.
1 Introduction
3D modeling and reconstruction can be applied to a variety of realworld applications, including visual rendering, modeling, and robotics. While it might not be difficult for human to infer 3D information from the observed 2D visual data, it is a very challenging task for machines to do so. Without sufficient 3D shapes, viewpoint information or 2D images from different viewpoints for as training data, it is difficult to reconstruct 3D models using 2D data.
Over the past few years, Convolution Neural Network (CNN) and generative adversarial networks (GAN) [9] have shown impressive progresses and results particularly in the areas of computer vision and image processing. For 3D shape reconstruction, several solutions have been proposed [8, 4, 5, 17, 1, 26, 13, 29, 7, 11, 21, 22, 28, 22, 14, 24, 19, 30, 6, 27]. As discussed later in Section 2, different settings and limitations such as the required number of 2D input images, availability of the viewpoint information, and supervision of ground truth labels would limit the use of existing models for 3D reconstruction applications.
In this paper, we address a challenging task of 3D model reconstruction from a single 2D image. That is, during training and testing, we only allow our model to observe a single 2D image as the input, without knowing its viewpoint information (e.g., azimuth, elevation, and roll). If full label supervision is available, we allow both 3D ground truth shape (in voxels) and the associated 2D foreground segmentation mask to guide the learning of our proposed model. As for unlabeled data during the semisupervised learning process, we only observe 2D images for training our model.
To handle the aforementioned challenging setting, we propose a deep learning architecture which not only recovers 3D shape and 2D mask outputs with full supervision. We further exploit their 2D3D selfconsistency during the learning process, which is the reason why we are able to utilize unlabeled image inputs to realize semisupervised learning. Finally, as a special characteristics of our model, we are able to jointly disentangle camera pose information during the above process, even no ground truth pose information is observed any time during training.
The contributions of this work are highlighted below:

In this paper, a unique semisupervised deep learning framework is proposed for 3D shape prediction from a single 2D image.

The presented network is trained in an endtoend fashion, while 2D3D selfconsistency is particularly introduced to handle unlabeled training image data.

In addition to 3D shape prediction, our model is able to disentangle camera pose information from the derived latent feature space, while supervision of such information is never available during training.

Experimental results quantitatively and qualitatively show that our method performs favorably against stateoftheart fully supervised and weaklysupervised methods.
2 Related Work
2.1 Learning for 3D Model Prediction
Most existing methods for 3D shape reconstruction require supervised settings, i.e., both 2D input images and their corresponding 3D models need to be observed during the training stage. With the development of largescale shape repository like ShapeNet [2], several methods of this category have been proposed [8, 4]. For example, the work of Girdhar et al. [8] is fully trained with pairwise 2D images and 3D models, which is realized by learning joint embedding for both 3D shapes and 2D images. On the other hand, Wang et al. [27] choose to predict 3D shape, 2D mask, and pose simultaneously, followed by construction of probabilistic visual hulls. However, they require additional ground truth pose information to train their model.
Since it might not be practical to assume that the ground truth 3D object information is always available, recent approaches like [29, 11, 28, 22, 24] manage to learn object representations in weaklysupervised settings. For example, Yan et al. [29] present Perspective Transformer Nets. Guided by the projected 2D masks and given camera viewpoints, their proposed network architecture learns the perspective transformations of the target 3D object. As a result, the 3D objects can be recovered using 2D images without the supervision of ground truth shape information. Alternatively, Gwak et al. [11] take 2D images and viewpoint information as input, and utilize 2D masks as weak supervision information, which is enabled by perspective projection of reconstructed 3D shapes to foreground masks. They constrain the reconstructed 3D shapes to a manifold observed by unlabeled shapes of realworld objects. Soltani et al. [22] learn a generative model over depth maps or their corresponding silhouettes, and then use a deterministic rendering function to construct 3D shapes from the generated depth maps and silhouettes. Although their network is able to generate images at different viewpoints from a single input when testing, it requires both silhouettes and depth maps from different views as ground truth for training purposes. To reconstruct 3D shapes, Tulsiani et al. [24] choose to utilize the consistency between shape and viewpoint information independently predicted from two camera views of the same instance. However, they also require multiple images of the same instance taken from different camera viewpoints for training purposes.
Different from the works requiring images taken by different cameras for 3D reconstruction, our method only needs a single image for 3D shape reconstruction, and can be trained in a semisupervised setting. Moreover, as noted and discussed in the following subsection, our model is able to disentangle the camera pose without the associated ground truth information.
2.2 Learning Interpretable Representation
Learning interpretable representation, or representation disentanglement, has attracted the attention fro researchers in the fields of computer vision and machine learning. Although Kingma et al. [15] utilize variational autoencoders to handle the generality of objects with input noise for improved data distribution. However, there is no guarantee that particular attributes in the derived latent space would correspond to desirable features. When it comes to recover 3D information from input 2D images, it is often desirable to be able to separate external parameters like viewpoint during the learning of visual object representation, so that the output data can be properly recovered or manipulated.
For representation disentanglement in 3D shape reconstruction, some existing approaches such as Adversarial Autoencoders (AAE) [18] manage to derive disentangled representations by supervised learning, matching the derived representations with specific labels and adversarial losses calculated from the discriminators. Weaklysupervised learning methods have also been proposed, which alleviate the need of utilizing fully labeled training data in the above process. For example, Deep Convolutional Inverse Graphics Network (DCIGN) [16] clamps certain dimensions of representation vectors from a minibatch of training instances for retrieving factors such as azimuth angle, elevation angle, azimuth of light source, or intrinsic properties. To perform representation disentanglement in a unsupervised setting, Information Maximizing GAN (InfoGAN) [3] chooses to maximize mutual information between latent variables and observation to learn disentangled representations.
Recently, Grant et al. [10] propose Deep Disentangled Representations for Volumetric Reconstruction, which takes 2D images as inputs and produces separate representations for 3D shapes of objects and parameters of viewpoint and lighting. Since their approach requires full data supervision for deriving the associated shape and transformation information (i.e., ground truth volumetric shapes always available during training), it would not be easy to extend their work to practical scenarios where only a portion of 2D image data are with ground truth 3D information. Sharing similar goals, we aim at identifying shape and viewpoint information in a semisupervised setting, while no ground truth camera pose information is observed during training.
3 PoseAware 3D Shape Reconstruction via Representation Disentanglement
In this paper, we propose a unique deep learning framework which observes a single 2D image for 3D shape reconstruction. Our model not only can be extended from fully supervised to semisupervised settings (i.e., only a portion of 2D images are with ground truth 3D information), it also exhibits abilities in disentangling camera pose information from the derived representation. This disentanglement process is performed in a unsupervised fashion, and is achieved by observing 2D3D selfconsistency as we detail later.
In the remaining of this section, we first describe the notation and architecture in Section 3.1. How to advance representation disentanglement with 2D3D selfconsistency for 3D reconstruction is detailed in Section 3.2. Finally, Section 3.3 summarizes learning process of our model.
3.1 Notations and Architecture
To reconstruct 3D shape information, we describe shapes in volumetric forms of probabilistic occupancy in this paper. For the sake of completeness, we now define the notations which will be used.
Let denote the training data, where indicates the input 2D image, and are the associated ground truth 3D voxel and 2D mask. Thus, our goal is to predict the 3D voxel (and the 2D mask ), while the camera pose information will be jointly predicted in the latent space . It is worth noting that, since we focus on a semisupervised setting, the data used for semisupervised learning are and , while the size of is considered to be larger than that of .
For the fully supervised version of our framework, every training input image has the corresponding ground truth 3D voxel and ground truth 2D mask . And, for the semisupervised version, only a portion of training images are with their 3D ground truth and 2D masks. Nevertheless, supervised or semisupervised learning, we never observe ground truth camera pose information during training, while jointly reconstructing 3D shape and camera pose information are our goals.
As shown in Figure 1, our proposed network architecture consists of four components:
Image Encoder . The encoder has a residual structure [12], which maps the input RGB image into an intrinsic shape representation (i.e., shape code) and an extrinsic viewpoint representation (i.e., pose code) . This disentangled pose code is then converted into camera pose through a FC layer. We note that, translation in the 3D space does not affect the quality of shape reconstruction, and we consider only elevation and azimuth for describing the camera pose.
Voxel Decoder . The goal of this decode is to recover the 3D voxels based on the input shape code . We follow [19] for apply 2D deconvolution layers as our voxel tube decoder. However, we choose to apply three separate 2D deconvolution operations, with the channel dimensions of the three deconvolution layers corresponding to height, width, and depth, respectively.
Mask Decoder . Different from the voxel decoder , the purpose of the mask decoder is to output the 2D mask by observing both the input pose code and shape code . We utilize a UNet [20] based structure for this 2D foreground mask segmentation procedure.
Module for 2D3D SelfConsistency. As a unique design in our network, this module observes the output 3D model and the 2D mask. More precisely, by taking the disentangled camera pose information from the pose code , this module enforces the consistency between the predicted 3D voxel and the corresponding 2D mask via the projection loss. As discussed in the following subsection, this is how we achieve disentanglement of extrinsic camera pose from intrinsic shape representation without observing any ground truth information. In other words, the introduction of this module realizes our poseaware 3D shape reconstruction.
3.2 Feature Disentanglement via 2D3D SelfConsistency
Our proposed model is capable of deep feature disentanglement and poseaware 3D reconstruction. It is worth repeating that, while our model can be train in fully supervised or semisupervised settings (i.e., observing ground truth voxels and 2D masks), ground truth camera pose information is never required.
As noted in Sect. 3.1, our image encoder maps input images into an intrinsic shape representation (or shape code) and an extrinsic viewpoint information (or pose code) .
The former is fed into to the voxel decoder to recover poseinvariant 3D voxels, while both and are the inputs to the mask decoder for 2D foreground mask segmentation. In order to convert the extracted pose code into an exact camera pose value (such as rotation and elevation), we transform pose code through one FC layer into camera pose and impose a 2D3D selfconsistency loss, aiming to align the predicted 3D voxels and the 2D mask with the disentangled pose code. We note that, as detailed below, the enforcement of 2D3D selfconsistency is the critical component for camera pose disentanglement, poseaware 3D reconstruction. It is also the key to enable semisupervised learning for our proposed network.
2D3D SelfConsistency Differentiable ray consistency loss has been utilized in [24, 25], which evaluates the inconsistency between mask and shape viewed from predicted camera pose. To achieve camera pose disentanglement, we advance the predicted camera pose information and the corresponding voxel outputs to generate a projection 2D mask, and then calculate the difference between the projection 2D mask and predicted ones from . This allows the alignment between the the 2D projection of 3D voxels using the disentangled camera pose.
To realize the above disentanglment process via camera pose alignment, we particularly consider a unique 2D3D selfconsistency loss which integrates the projection loss and ray consistency loss as follows {dmath} L_sc = α_1L_ray + α_2L_proj.
For ray consistency loss, we consider a ray passing through a mask at location (,) to be projected, and traveling along the voxel (as illustrated in Fig. 2). We sample values along this ray, and the sampled value represents the occupancy at this sample point. Next, for each sample point, the probability of the ray stops at that point is calculated. If the mask value at (,) is 0, the probability that the ray penetrates across sample points is close to 1. On the other hand, if the mask value at (,) is 1, this probability would be close to 0.
Now we introduce the details of ray consistency loss. We consider the ray passing through a mask at location (, ). Given camera intrinsic parameters (, , , ), where (, ) is focal length of camera and (, ) is optical center of camera, we can determine the direction of this ray as (, , ). We sample points along this ray, and the location of sample point in the camera coordinate frame is ( (), ( ), ), where .
To calculate the probability of occupancy at this sample point, given camera rotation matrix (parameterized by camera pose ) and camera translation , we first map the location of the sample point into (). Then we determine the occupancy of the sample point by trilinear sampling , which is shown as below:
(1) 
Next, the probability of the ray passing through the pixel (,) stops at sample point can be obtained, which is shown as below:
(2) 
If the ray does not stop at any sample point, then it will penetrate the whole voxel. As a result, we can extend (2) to obtained the probability that this ray that escapes the voxel, and we denote this probability as ,
(3) 
With the above observation, we have the ray consistency loss at pixel (, ) defined below:
{dmath}
L_ray(u, v) = m(u, v)q_(u, v), escape^p
+
(1m(u, v))∑_i=1^Nq_u, v^p(i),
where is the value of the mask at location (,).
If is 0, the probability that the ray penetrates the voxel is close to 1.
On the other hand, when is 1, the probability that the ray terminates at any sample point is near 0.
As a result, we have the differentiable ray consistency loss () calculated as the mean of () over all pixels in .
In addition to ray consistency, we also consider the projection loss for 2D3D selfconsistency. As shown in (3.2), () represents the pixel (, ) of 3D2D projection (, ). As a result, we obtain the projection (, ) after calculation every ray passing through mask , as shown in Fig 2. That is, if the ray terminates at that voxels, we expect to be close to 0, and () would be near and represents an object pixel in projection (, ), as depicted in Fig. 2. To evaluate the 3D reconstruction, we consider the metric of intersection over union (IoU) and impose the IoU loss [19] between projection and mask as shown below:
(4) 
As (3.2)(4) are differentiable w.r.t. camera pose and voxel , this loss would guide the training of both camera pose estimation and 3D shape reconstruction.
In our work, we consider and fix and in (3.2). Because some fine structures like the bases of chairs would diminish when mask is of size as used in [24], we use mask of size .
Finally, we note that the voxel and mask discussed here can be either ground truth or predicted ones. As shown in Fig. 1, we use ground truth voxel and mask for 2D3D selfconsistency loss if fullsupervised learning is applicable. If semisupervised learning is of interest, we then consider the predicted voxel and its mask to calculate the 2D3D selfconsistency loss for the unlabeled data.
3.3 Learning of Our Model
Supervised Learning. To train our model in a fully supervised setting, both ground truth voxel and mask are observed during training, while the ground truth camera pose is not available. The overall loss function for fully supervised learning is shown as below: {dmath} L_sup = α_3L_3D + α_4L_2D + α_5L_sc + α_6L_KL.
Here, we calculate the 2D3D consistency loss between the ground truth voxel and mask instead of the predicted ones, which allows the training efficiency and the effectiveness in camera pose disentanglement. We use , and to update image encoder , and and for updating voxel decoder and mask decoder , respectively. In our work, we fix , , , and .
The voxel reconstruction loss consists of positive weighted cross entropy and IoU losses as shown below:
(5) 
where adopts a positive weight to better preserve the fine structures in 3D, and is defined below:
(6) 
Without this technique, the model tends to predict zeros for the voxels corresponding to such structures which minimizes the overall cross entropy loss. As for the IoU loss [19], it is calculated as:
(7) 
For 2D mask segmentation, we calculate the cross entropy between the predicted and ground truth masks as our 2D reconstruction loss .
As for the KL divergence loss, it is enforced to regularize the distribution of shape code and model ambiguity of 3D reconstruction due to unseen parts of shapes [6]. We adopt conditional variational autoencoder [18, 15] for and as shown below:
(8) 
where the shape code consists of mean and variance . Only the mean is utilized by the mask decoder as mask segmentation is underdetermined.
Semisupervised Learning. In practice, we cannot collect the ground truth 3D voxels and 2D masks for all images, and thus a semisupervised setting would be of interest. In this setting, only a portion of input images are with ground truth and , while the remaining data for training are the 2D images only. With such training data, we first pretrain our network using fully supervised data, followed by finetuning the network using unlabeled input images only. To be more specific, for semisupervised learning, the overall loss function is defined as below: {dmath} L_semi = L_sc + α_6L_KL. Note that and are calculated to update image encoder , while for updating voxel decoder .
We note that, in the network refinement process under this semisupervised setting, our introduced 2D3D selfconsistency is critical. This allows our model to align the predicted 3D voxel and the predicted 2D mask with disentangled camera pose , all in an unsupervised fashion.
airplane  car  chair  Mean  

Method  
3DR2N2 [4]  51.3  79.8  46.6  59.2 
OGN [23]  58.7  81.6  48.3  62.9 
PSGN [6]  60.1  83.1  54.4  65.9 
voxel tube [19]  67.1  82.1  55.0  68.1 
Matryoshka [19]  64.7  85.0  54.7  68.1 
Ours  69.2  85.8  56.7  70.6 
4 Experiments
4.1 Dataset
We consider the ShapeNet dataset [2] which contains a rich collection of 3D CAD models, and is widely used in recent research works related to 2D/3D data. Three categories, airplane, car, and chair, are selected for our experiments. For fair comparisons, we consider two different data settings. For supervised learning of our model and to perform comparisons, we follow the works of 3DR2N2 [4], Octree Generating Network (OGN) [23], Point Set Generation Network (PSGN) [6], voxel tube network and Matryoshka network [19], which scale the ground truth voxels to fit into grids. This makes ground truth voxels larger than those considered in MVC [24] and DRC [25]. We use the same rendered images, ground truth voxel, and data split as used in these work.
For semisupervised learning, we generate rendered images of size pixels and the corresponding ground truth 2D masks, using the same camera pose information and data split as used in Perspective Transformer Nets (PTN) [29]. We utilize the same ground truth voxels in Multiview Consistency (MVC) [24] and Differentiable Ray Consistency (DRC) [25] to fit our projection module, and the grid size of voxels is .
4.2 Fully Supervised Learning
To train our model using fullysupervised data, we consider stateoftheart methods of [4, 23, 6, 19] for comparisons. Since both ground truth voxels and 2D masks are available during training, there is no need to utilize the 2D3D selfconsistency loss for camera pose disentanglement.
Quantitative results of different 3D reconstruction methods are listed in Table. 1. From this table, we see that our network achieved favorable performances, showing that our network architecture is preferable under such settings.
Training data  Label percentage  Single/multi views  airplane  car  chair  Mean  

Method  
DRC [25]  2D mask & pose  100%  M  53.5  65.7  49.3  56.2 
MVC [24]  2D mask  100%  M  49.4  63.9  39.6  51.0 
Ours (Semi)  2D mask & voxel  15%  S  59.6  75.3  52.3  62.4 
Ours (Sup)  2D mask & voxel  100%  S  68.8  80.1  57.2  68.7 
4.3 SemiSupervised Learning
To demonstrate the effect of our semisupervised learning, we use different portions of labeled/unlabeled images to train the network for 3D reconstruction. For comparison purposes, we use 100, 80, 60, 40, 20 of labeled images for training fullysupervised version of our model, and the rest of the images are unused. As for semisupervised learning, we use 80, 60, 40, 20 of labeled images, plus the remaining unlabeled ones for training our model.
We compare the results of our models with fully supervised and semisupervised learning, and compare the quantitative results in Fig. 3. From this table, we see that our semisupervised version exhibited very promising capabilities in handling unlabeled data, which resulted in improved performances compared to the versions using only labeled data for training. Qualitative results are also shown in Fig. 4 for visualization purposes.
To perform more complete comparisons with recent approaches not requiring ground truth 3D shape information, we consider the works of MVC [24] and DRC [25]. Note that both DRC and MVC require multiview images as inputs for training; morevoer, DRC requires ground truth camera pose information. Table. 2 lists and compares the performances, in which our method clearly achieved the best results among all. Fig. 5 additionally presents qualitative results, which visually present and compare the 3D reconstruction abilities of different models.
4.4 Camera Pose Prediction
To demonstrate the ability of our model in disentangling camera pose information without such supervision, we transform our voxels into poseaware shapes with our predicted and manipulated camera poses, as depicted in Fig. 6. From the results shown in figure, we see that the use of our model for learning interpretable visual representations from 2D images, including shape and camera pose features, can be successfully verified.
We also present the quantitative results of camera pose prediction compared to MVC [24] in Table. 3. It is worth pointing out that, MVC [24] requires multiview images as their inputs. From this table, we see that our model succesfully disentangle and predict the camera pose information, without supervision of such information.
4.5 Ablation Studies
To assess the contributions of each component and the design of our proposed network architecture, we perform the following ablation studies.
Network Architecture.
Recall that our proposed voxel decoder consists of three separate 2D deconvolution layers in three directions, with each of the channel dimensions corresponding to height, width, and depth, respectively. We compare our proposed voxel decoder with the one using 2D deconvolution layers in only one direction [19].
For fair comparisons, we fix other components in our network, and compare the performances of the networks utilizing these two different kinds of voxel decoders. From the results shown in Table 4, we see that our decoder design resulted in improved 3D shape reconstruction performances.
2D3D SelfConsistency. We assess the contribution of our network module in observing 2D3D selfconsistency. For simplicity, we only consider ray consistency loss with and without the projection loss, and we use ground truth camera pose to calculate (3.2) and (4). Then, we compare the resulting 3D reconstruction results.
From the comparison results shown in Table 5, we confirm that the introduced 2D3D consistency combining both ray consistency and projection losses would improve the performance of 3D reconstruction. Thus, exploiting this property for 3D shape reconstruction would be preferable.
Method  airplane  chair 

MVC [24]  35.10°  9.41° 
Ours (Sup)  11.24°  9.82° 
Ours (Semi)  9.34°  6.24° 
5 Conclusion
In this paper, we proposed a deep learning framework for singleimage 3D reconstruction. Our proposed model is able to learn deep representation from a single 2D input for recovering its 3D voxel. This is achieved by disentangling unknown camera pose information from the above features via exploiting 2D3D selfconsistency. More importantly, no camera pose information, classification labels, or discriminator is required. Our method can be trained on fullysupervised setting as most 3D shape reconstruction models do, which utilize 2D images and their ground truth 3D voxels for training. Also, it can be trained on semisupervised settings, which use additional unlabeled 2D images to further enhance the reconstruction results. Both quantitative and qualitative results demonstrated that our method was able to produce satisfactory results when comparing to stateoftheart approaches with fully or semisupervised settings. Thus, the effectiveness and robustness of our model can be successively verified.
Decoder design  airplane  car  chair  Mean 

1 deconv  68.6  83.7  56.2  69.5 
3 deconvs  69.2  85.8  56.7  70.6 
Method  airplane  car  chair  Mean 

Ray  53.5  65.7  49.3  56.2 
Ray + Projection  55.7  66.0  49.6  57.1 
References
 [1] T. J. Cashman and A. W. Fitzgibbon. What shape are dolphins? building 3d morphable models from 2d images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):232–244, Jan. 2013.
 [2] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q.X. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An informationrich 3d model repository. CoRR, abs/1512.03012, 2015.
 [3] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Neural Information Processing Systems (NIPS), 2016.
 [4] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. In European Conference Computer Vision (ECCV), 2016.
 [5] J. Delanoy, M. Aubry, P. Isola, A. Efros, and A. Bousseau. 3d sketching using multiview deep volumetric prediction. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 1(21), may 2018.
 [6] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2463–2471, 2017.
 [7] M. Gadelha, S. Maji, and R. Wang. 3d shape induction from 2d views of multiple objects. In International Conference on 3D Vision (3DV), pages 402–411, 2017.
 [8] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In European Conference Computer Vision (ECCV), 2016.
 [9] I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In Neural Information Processing Systems (NIPS), 2014.
 [10] E. Grant, P. Kohli, and M. van Gerven. Deep disentangled representations for volumetric reconstruction. In ECCV Workshops, 2016.
 [11] J. Gwak, C. B. Choy, M. Chandraker, A. Garg, and S. Savarese. Weakly supervised 3d reconstruction with adversarial constraint. In 3D Vision (3DV), International Conference on 3D Vision, 2017.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 [13] D. Jimenez Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of 3d structure from images. In Neural Information Processing Systems (NIPS), pages 4996–5004, 2016.
 [14] A. Kar, C. Häne, and J. Malik. Learning a multiview stereo machine. In Neural Information Processing Systems (NIPS), 2017.
 [15] D. P. Kingma and M. Welling. Autoencoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
 [16] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network. In Neural Information Processing Systems (NIPS), 2015.
 [17] Z. Lun, M. Gadelha, E. Kalogerakis, S. Maji, and R. Wang. 3d shape reconstruction from sketches via multiview convolutional networks. In International Conference on 3D Vision (3DV), 2017.
 [18] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. In International Conference on Learning Representations (ICLR), 2016.
 [19] S. R. Richter and S. Roth. Matryoshka networks: Predicting 3d geometry via nested shape layers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [20] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In Medical Image Computing and ComputerAssisted Intervention (MICCAI), pages 234–241. Springer International Publishing, 2015.
 [21] N. Savinov, C. Häne, L. Ladický, and M. Pollefeys. Semantic 3d reconstruction with continuous regularization and ray potentials using a visibility consistency constraint. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5460–5469, June 2016.
 [22] A. A. Soltani, H. Huang, J. Wu, T. D. Kulkarni, and J. B. Tenenbaum. Synthesizing 3d shapes via modeling multiview depth maps and silhouettes with deep generative networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2511–2519, 2017.
 [23] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for highresolution 3d outputs. In IEEE International Conference on Computer Vision (ICCV), pages 2107–2115, 2017.
 [24] S. Tulsiani, A. A. Efros, and J. Malik. Multiview consistency as supervisory signal for learning shape and pose prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [25] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multiview supervision for singleview reconstruction via differentiable ray consistency. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 209–217, 2017.
 [26] A. O. Ulusoy, A. Geiger, and M. J. Black. Towards probabilistic volumetric reconstruction using ray potentials. In International Conference on 3D Vision (3DV), pages 10–18, 2015.
 [27] H. Wang, J. Yang, W. Liang, and X. Tong. Deep singleview 3d object reconstruction with visual hull embedding. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019.
 [28] J. Wu, Y. Wang, T. Xue, X. Sun, W. T. Freeman, and J. B. Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In Neural Information Processing Systems (NIPS), 2017.
 [29] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning singleview 3d object reconstruction without 3d supervision. In Neural Information Processing Systems (NIPS), 2016.
 [30] R. Zhu, H. K. Galoogahi, C. Wang, and S. Lucey. Rethinking reprojection: Closing the loop for poseaware shape reconstruction from a single image. In IEEE International Conference on Computer Vision (ICCV), pages 57–65, Oct 2017.
Appendix
Appendix A Implementation Details
We implement the proposed network by PyTorch. The image encoder , voxel decoder , and mask decoder are not pretrained and are all randomly initialized. We choose to use ADAM optimizer to train , and . With fullysupervised learning settings, the learning rate for , , and is set to . As for the semisupervised learning, the learning rates for , , and are all set to . The batch size is set to 48. The positive weight for is set to 3. We train our model on a single NVIDIA GeForce GTX 1080 Ti GPU with 11 GB memory. More details about the network architecture is in Sec. B. Besides, we will release our code of this work so that more details can be shown.
Appendix B Network Architecture
We now describe the detailed network architectures, including image encoder , mask decoder , and voxel decoder .
Image Encoder .
Our image encoder has residual structures. More specifically, it is composed of five designed residual blocks.
The activation function for each block output is a leaky rectified linear unit (Leaky ReLU).
Mask Decoder .
Our mask decoder has a UNet based structure.
It is composed of 5 upsampling blocks.
Voxel Decoder .
The proposed voxel decoder contains three separate 2D deconvolution layers, each of the three channel dimensions corresponding to height, width, and depth, respectively.
As illustrated in Fig. 7, the two output tensors are first transposed so that the channels would be mapped to different spatial dimensions.
We take the elementwise mean from the three tensors to obtain the predicted shape.
The complete architecture design of each network component is shown in Fig. 8.
Appendix C Visualization and Comparisons
We now provide additional visualization of 3D shape reconstruction using different weakly or semisupervised methods, including ours. Note that both DRC and MVC require multiview images as inputs for training. In addition, DRC requires ground truth camera pose information when learning its model. Example results are shown in Fig. 9.