3DPSRNet: Part Segmented 3D Point Cloud Reconstruction From a Single Image
Abstract
We propose a mechanism to reconstruct part annotated 3D point clouds of objects given just a single input image. We demonstrate that jointly training for both reconstruction and segmentation leads to improved performance in both the tasks, when compared to training for each task individually. The key idea is to propagate information from each task so as to aid the other during the training procedure. Towards this end, we introduce a locationaware segmentation loss in the training regime. We empirically show the effectiveness of the proposed loss in generating more faithful part reconstructions while also improving segmentation accuracy. We thoroughly evaluate the proposed approach on different object categories from the ShapeNet dataset to obtain improved results in reconstruction as well as segmentation. Codes are available at https://github.com/valiisc/3dpsrnet.
Keywords:
Point cloud, 3D reconstruction, 3D part segmentation1 Introduction
Human object perception is based on semantic reasoning [8]. When viewing the objects around us, we can not only mentally estimate their 3D shape from limited information, but we can also reason about object semantics. For instance, upon viewing the image of an airplane in Figure 1, we might deduce that it contains four distinct parts  body, wings, tail, and turbine. Recognition of these parts further enhances our understanding of individual part geometries as well as the overall 3D structure of the airplane. This ability to perceive objects driven by semantics is important for our interaction with the world around us and the manipulation of objects within it.
In machine vision, the ability to infer the 3D structures from singleview images has farreaching applications in the field of robotics and perception. Semantic understanding of the perceived 3D object is particularly advantageous in tasks such as robot grasping, object manipulation, etc.
Deep neural networks have been successfully employed for tackling the problem of 3D reconstruction. Most of the existing literature propose techniques for predicting the voxelized representation format. However, this representation has a number of drawbacks. First, it suffers from sparsity of information. All the information that is needed to perceive the 3D structure is provided by the surface voxels, while the voxels within the volume increase the representation space with minimal addition of information. Second, the neural network architectures required for processing and predicting 3D voxel maps make use of 3D CNNs, which are computationally heavy and lead to considerable overhead during training and inference. For these reasons, there have been concerted efforts to explore representations that involve reduced computational complexity compared to voxel formats. Very recently, there have been works focusing on designing neural network architectures and loss formulations to process and predict 3D point clouds [13, 14, 3, 16, 9]. Since point clouds consist of points being sampled uniformly on the object’s surface, they are able to encode maximal information about the object’s 3D characteristics. The informationrich encoding and computefriendly architectures makes it an ideal candidate for 3D shape generation and reconstruction tasks. Hence, we consider the point cloud as our representation format.
In this work, we seek to answer three important questions in the tasks of semantic object reconstruction and segmentation: {enumerate*}[label=(0)]
What is an effective way of inferring an accurate semantically annotated 3D point cloud representation of an object when provided with its twodimensional image counterpart?
How do we incorporate object geometry into the segmentation framework so as to improve segmentation accuracy?
How do we incorporate semantic understanding into the reconstruction framework so as to improve the reconstruction of individual parts? We achieve the former by training a neural network to jointly optimize for the reconstruction as well as segmentation losses. We empirically show that such joint training achieves superior performance on both reconstruction and segmentation tasks when compared to two different neural networks that are trained on each task independently. To enable the flow of information between the two tasks, we propose a novel loss formulation to integrate the knowledge from both the predicted semantics and the reconstructed geometry.
In summary, our contributions in this work are as follows:

We propose 3DPSRNet, a part segmented 3D reconstruction network, which is jointly optimized for the tasks of reconstruction and segmentation.

To enable the flow of information from one task to another, we introduce a novel loss function called locationaware segmentation loss. We empirically show that the proposed loss function aids in the generation of more faithful part reconstructions, while also resulting in more accurate segmentations.

We evaluate 3DPSRNet on a synthetic dataset to achieve stateoftheart performance in the task of semantic 3D object reconstruction from a single image.
2 Related Work
3D Reconstruction
In recent times, deep learning based approaches have achieved significant progress in the field of 3D reconstruction. The earlier works focused on voxelbased representations [4, 19, 2]. Girdhar et al. [4] map the 3D model and the corresponding 2D representations to a common embedding space to obtain a representation which is both predictable from 2D images and is capable of generating 3D objects. Wu et al. [19] utilize variational autoencoders with an additional adversarial criterion to obtain improved reconstructions. Choy et al. [2] employ a 3D recurrent network to obtain reconstructions from multiple input images. While the above works directly utilize the ground truth 3D models in the training stage, [20, 17, 18, 22] try to reconstruct the 3D object using 2D observations from multiple viewpoints.
Several recent works have made use of point clouds in place of voxels to represent 3D objects [3, 5, 11]. Fan et al. [3] showed that point cloud prediction is not only computationally efficient but also outperforms voxelbased reconstruction approaches. Groueix et al. [5] represented a 3D shape as a collection of parametric surface elements and constructed a mesh from the predicted point cloud. Mandikal et al. [11] trained an image encoder in the latent space of a point cloud autoencoder, while also enforcing a constraint to obtain diverse reconstructions. However, all of the above works focus solely on the point cloud reconstruction task.
3D Semantic Segmentation
Semantic segmentation using neural networks has been extensively studied in the 2D domain [10, 6]. The corresponding task in 3D has been recently explored by works such as [15, 13, 14, 7, 12]. Song et al. [15] take in a depth map of a scene as input and predict a voxelized occupancy grid containing semantic labels on a pervoxel basis. They optimize for the multiclass segmentation loss and argue that scene completion aids semantic label prediction and vice versa. Our representation format is a 3D point cloud while [15] outputs voxels. This gives rise to a number of differences in the training procedure. Voxel based methods predict an occupancy grid and hence optimize for the crossentropy loss for both reconstruction as well as segmentation. On the other hand, point cloud based works optimize distancebased metrics for reconstruction and crossentropy for segmentation. We introduce a locationaware segmentation loss tailored for point cloud representations.
[13, 14] introduce networks that take in point cloud data so as to perform classification and segmentation. They introduce network architectures and loss formulations that are are able to handle the inherent unorderedness of the point cloud data. While [3] predicts only the 3D point cloud geometry from 2D images, and [13, 14] segment input point clouds, our approach stresses the importance of jointly optimizing for reconstruction and segmentation while transitioning from 2D to 3D.
3 Approach
In this section, we introduce our model, 3DPSRNet, which generates a partsegmented 3D point cloud from a 2D RGB image. As a baseline for comparison, we train two separate networks for the task of reconstruction and segmentation (Figure 2(a)). Given an RGB image as input, the reconstruction network (baseline_{rec}) outputs a 3D point cloud , where is the number of points in the point cloud. Given a 3D point cloud as input, the segmentation network (baseline_{seg}) predicts the class labels , where is the number of classes present in the object category. During inference, image is passed through baseline_{rec} to obtain , which is then passed through baseline_{seg} to obtain .
Our training pipeline consists of jointly predicting , (Figure 2(b)). The reconstruction network is modified such that an additional predictions, representing the class probabilities of each point, are made at the final layer. The network is simultaneously trained with reconstruction and segmentation losses, as explained below.
3.1 Loss Formulation
Reconstruction Loss We require a loss formulation that is invariant to the order of points in the point cloud. To satisfy this criterion, the Chamfer distance between the ground truth point cloud and predicted point cloud is chosen as the reconstruction loss. The loss function is defined as:
(1) 
Segmentation Loss We use pointwise softmax crossentropy loss (denoted by ) between the ground truth class labels and the predicted class labels . For the training of baseline_{seg}, since there is direct pointtopoint correspondence between and , we directly apply the segmentation loss as the crossentropy loss between and :
(2) 
However, during joint training, there exists no such pointtopoint correspondence between the ground truth and predicted class labels. We therefore introduce the locationaware segmentation loss to propagate semantic information between matching point pairs (Figure 2(c)). The loss consists of two terms:

Forward segmentation loss (): For every point , we find the closest point , and apply on their corresponding class labels.
(3) 
Backward segmentation loss (): For every point , we find the closest point , and apply on their corresponding class labels.
(4)
The overall segmentation loss is then the summation of the forward and backward segmentation losses:
(5) 
The total loss during joint training is then given by,
(6) 
3.2 Implementation Details
For training the baseline segmentation network baseline_{seg}, we follow the architecture of the segmentation network of PointNet [13], which consists of ten 1D convolutional layers of filter sizes , where is the number of class labels. A global maxpool function is applied after the fifth layer and the resulting feature is concatenated with each individual point feature, as is done in the original paper. Learning rate is set to and batch normalization is applied at all the layers of the network. The networks for the baseline reconstruction network and the joint 3DPSRNet are similar in architecture. They consist of four 2D convolutional layers with number of filters as , followed by four fully connected layers with output dimensions of size (reconstruction) and (joint), where is the number of points in the point cloud. We set to be 1024 in all our experiments. Learning rate for baseline_{rec} and 3DPSRNet are set to and respectively. We use a minibatch size of 32 in all the experiments. We train the individual reconstruction and segmentation networks for 1000 epochs, while the joint network (3DPSRNet) is trained for 500 epochs. We choose the best model according to the corresponding minimum loss. In Eq. 6, the values of and are set to and respectively for joint training. Codes are available at https://github.com/valiisc/3dpsrnet.
4 Experiments
4.1 Dataset
We train all our networks on synthetic models from the ShapeNet dataset [1] whose part annotated ground truth point clouds are provided by [21]. Our dataset comprises of 7346 models from three exemplar categories  chair, car and airplane. We render each model from ten different viewing angles with azimuth values in the range of and elevation values in the range of so as to obtain a dataset of size 73,460. We use the train/validation/test split provided by [21] and train a single model on all the categories in all our experiments.
4.2 Evaluation Methodology

Reconstruction: We report both the Chamfer Distance (Eqn. 1) as well as the Earth Mover’s Distance (or EMD) computed on 1024 points in all our evaluations. EMD between two point sets and is given by:
(7) where is a bijection. For computing the metrics, we renormalize both the ground truth and predicted point clouds within a bounding box of length 1 unit.

Segmentation: We formulate part segmentation as a perpoint classification problem. Evaluation metric is mIoU on points. For each shape S of category C, we calculate the shape mIoU as follows: For each part type in category C, compute IoU between groundtruth and prediction. If the union of groundtruth and prediction points is empty, then count part IoU as 1. Then we average IoUs for all part types in category C to get mIoU for that shape. To calculate mIoU for the category, we take average of mIoUs for all shapes in that category. Since there is no correspondence between the ground truth and predicted points, we use a mechanism similar to the one described in Section 3.1 for computing the forward and backward mIoUs, before averaging them out to get the final mIoU as follows:
(8) where is the number of points in category in predicted as category in for forward point correspondences between and . Similarly is for backward point correspondences. is the total number of categories.
Category  Metric 

3DPSRNet  

Chair  Chamfer  6.82  6.57  
EMD  11.37  10.10  
mIoU  78.09  81.92  
Car  Chamfer  5.48  5.14  
EMD  5.88  5.53  
mIoU  59.0  61.57  
Airplane  Chamfer  4.06  4.06  
EMD  7.06  6.24  
mIoU  62.86  68.64  
Mean  Chamfer  5.45  5.26  
EMD  8.10  7.29  
mIoU  66.65  70.71 
4.3 Results
Table 1 presents the quantitative results on ShapeNet for the baseline and joint training approaches. 3DPSRNet achieves considerable improvement in both the reconstruction (Chamfer, EMD) and segmentation (mIoU) metrics. It outperforms the baseline approach in every metric on all categories. On an average, we obtain 4.1% improvement in mIoU.
The qualitative results are presented in Figures 3 and 4. 3DPSRNet obtains more faithful reconstructions compared to the baseline to achieve better correspondence with the input image. It also predicts more uniformly distributed point clouds. We observe that joint training results in reduced hallucination of parts (for e.g. predicting handles for chairs without handles) and spurious segmentations. We also show a few failure cases of our approach in Figure 5. The network misses out on some finer structures present in the object (e.g. dual turbines in the case of airplanes). The reconstructions are poorer for uncommon input samples. However, these drawbacks also exist in the baseline approach.
4.4 Relative Importance of Reconstruction and Segmentation losses
We present an ablative study on the relative weightage of the reconstruction and segmentation losses in Eq. 6. We fix the value of to one, while is varied from to . Figure 6 presents the plot of Chamfer, EMD and mIoU metrics for varying values of . We observe that for very low value of , both the reconstruction and segmentation metrics are worse off, while there is minimal effect on the average metrics for greater than . Based on Figure 6, we set the value of to in all our experiments.
5 Conclusion
In this paper, we highlighted the importance of jointly learning the tasks of 3D reconstruction and object part segmentation. We introduced a loss formulation in the training regime to enable propagating information between the two tasks so as to generate more faithful part reconstructions while also improving segmentation accuracy. We thoroughly evaluated against existing reconstruction and segmentation baselines, to demonstrate the superiority of the proposed approach. Quantitative and qualitative evaluation on the ShapeNet dataset demonstrate the effectiveness in generating more accurate point clouds with detailed part information in comparison to the current stateoftheart reconstruction and segmentation networks.
References
 [1] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An informationrich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
 [2] Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3Dr2n2: A unified approach for single and multiview 3D object reconstruction. In: European Conference on Computer Vision. pp. 628–644. Springer (2016)
 [3] Fan, H., Su, H., Guibas, L.: A point set generation network for 3D object reconstruction from a single image. In: Conference on Computer Vision and Pattern Recognition (CVPR). vol. 38 (2017)
 [4] Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: European Conference on Computer Vision. pp. 484–499. Springer (2016)
 [5] Groueix, T., Fisher, M., Kim, V.G., Russell, B., Aubry, M.: AtlasNet: A PapierMâché Approach to Learning 3D Surface Generation. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2018)
 [6] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask rcnn. In: Computer Vision (ICCV), 2017 IEEE International Conference on. pp. 2980–2988. IEEE (2017)
 [7] Kalogerakis, E., Averkiou, M., Maji, S., Chaudhuri, S.: 3d shape segmentation with projective convolutional networks
 [8] Koopman, S.E., Mahon, B.Z., Cantlon, J.F.: Evolutionary constraints on human object perception. Cognitive science 41(8), 2126–2148 (2017)
 [9] Li, Y., Bu, R., Sun, M., Chen, B.: PointCNN. arXiv preprint arXiv:1801.07791 (2018)
 [10] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)
 [11] Mandikal, P., K L, N., Agarwal, M., Babu, R.V.: 3DLMNet: Latent embedding matching for accurate and diverse 3d point cloud reconstruction from a single image. In: Proceedings of the British Machine Vision Conference (BMVC) (2018)
 [12] Muralikrishnan, S., Kim, V.G., Chaudhuri, S.: Tags2parts: Discovering semantic regions from shape tags. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2926–2935 (2018)
 [13] Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3D classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1(2), 4 (2017)
 [14] Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems. pp. 5105–5114 (2017)
 [15] Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. pp. 190–198. IEEE (2017)
 [16] Su, H., Jampani, V., Sun, D., Maji, S., Kalogerakis, V., Yang, M.H., Kautz, J.: Splatnet: Sparse lattice networks for point cloud processing. arXiv preprint arXiv:1802.08275 (2018)
 [17] Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multiview supervision for singleview reconstruction via differentiable ray consistency. In: CVPR. vol. 1, p. 3 (2017)
 [18] Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, B., Tenenbaum, J.: Marrnet: 3D shape reconstruction via 2.5 d sketches. In: Advances In Neural Information Processing Systems. pp. 540–550 (2017)
 [19] Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3D generativeadversarial modeling. In: Advances in Neural Information Processing Systems. pp. 82–90 (2016)
 [20] Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: Learning singleview 3D object reconstruction without 3D supervision. In: Advances in Neural Information Processing Systems. pp. 1696–1704 (2016)
 [21] Yi, L., Kim, V.G., Ceylan, D., Shen, I.C., Yan, M., Su, H., Lu, C., Huang, Q., Sheffer, A., Guibas, L.: A scalable active framework for region annotation in 3d shape collections. SIGGRAPH Asia (2016)
 [22] Zhu, R., Galoogahi, H.K., Wang, C., Lucey, S.: Rethinking reprojection: Closing the loop for poseaware shape reconstruction from a single image. In: Computer Vision (ICCV), 2017 IEEE International Conference on. pp. 57–65. IEEE (2017)