PointGrow: Autoregressively Learned Point Cloud Generation with SelfAttention
Abstract
A point cloud is an agile 3D representation, efficiently modeling an object’s surface geometry. However, these surfacecentric properties also pose challenges on designing tools to recognize and synthesize point clouds. This work presents a novel autoregressive model, PointGrow, which generates realistic point cloud samples from scratch or conditioned on given semantic contexts. Our model operates recurrently, with each point sampled according to a conditional distribution given its previouslygenerated points. Since point cloud object shapes are typically encoded by longrange interpoint dependencies, we augment our model with dedicated selfattention modules to capture these relations. Extensive evaluation demonstrates that PointGrow achieves satisfying performance on both unconditional and conditional point cloud generation tasks, with respect to fidelity, diversity and semantic preservation. Further, conditional PointGrow learns a smooth manifold of given image conditions where 3D shape interpolation and arithmetic calculation can be performed inside. Code and models are available at: https://github.com/syb7573330/PointGrow.
PointGrow: Autoregressively Learned Point Cloud Generation with SelfAttention
Yongbin Sun, Yue Wang, Ziwei Liu^{†}^{†}thanks: corresponding author., Joshua E. Siegel & Sanjay E. Sarma 
1. Massachusetts Institute of Technology 2. The Chinese University of Hong Kong 
1 Introduction
3D visual understanding (Bogo et al. (2016); Zuffi et al. (2018)) is at the core of nextgeneration vision systems. Specifically, point clouds, agile 3D representations, have emerged as indispensable sensory data in applications including indoor navigation (Díaz Vilariño et al. (2016)), immersive technology (Stets et al. (2017); Sun et al. (2018a)) and autonomous driving (Yue et al. (2018)). There is growing interest in integrating deep learning into point cloud processing (Klokov & Lempitsky (2017); Lin et al. (2017a); Achlioptas et al. (2017b); Kurenkov et al. (2018); Yu et al. (2018); Xie et al. (2018)). With the expressive power brought by modern deep models, unprecedented accuracy has been achieved on highlevel point cloud related tasks including classification, detection and segmentation (Qi et al. (2017a; b); Wang et al. (2018); Li et al. (2018); You et al. (2018)). Yet, existing point cloud research focuses primarily on developing effective discriminative models (Xie et al. (2018); Shen et al. (2018)), rather than generative models.
This paper investigates the synthesis and processing of point clouds, presenting a novel generative model called PointGrow. We propose an autoregressive architecture (Oord et al. (2016); Van Den Oord et al. (2016)) to accommodate the surfacecentric nature of point clouds, generating every single point recurrently. Within each step, PointGrow estimates a conditional distribution of the point under consideration given all its preceding points, as illustrated in Figure 1. This approach easily handles the irregularity of point clouds, and encodes diverse local structures relative to point distancebased methods (Fan et al. (2017); Achlioptas et al. (2017b)).
However, to generate realistic point cloud samples, we also need longrange part configurations to be plausible. We therefore introduce two selfattention modules (Lin et al. (2017b); Velickovic et al. (2017); Zhang et al. (2018)) in the context of point cloud to capture these longrange relations. Each dedicated selfattention module learns to dynamically aggregate longrange information during the point generation process. In addition, our conditional PointGrow learns a smooth manifold of given images where interpolation and arithmetic calculation can be performed on image embeddings.
Compared to prior art, PointGrow has appealing properties:

Unlike traditional 3D generative models that rely on local regularities on grids (Wu et al. (2016); Choy et al. (2016); Yang et al. (2017); Sun et al. (2018b)), PointGrow builds upon autoregressive architecture that is inherently suitable for modeling point clouds, which are irregular and surfacecentric.

Our proposed selfattention module successfully captures the longrange dependencies between points, helping to generate plausible part configurations within 3D objects.

PointGrow, as a generative model, enables effective unsupervised feature learning, which is useful for recognition tasks, especially in the lowdata regime.
Extensive evaluations demonstrate that PointGrow can generate realistic and diverse point cloud samples with high resolution, on both unconditional and conditional point cloud generation tasks.
2 PointGrow
In this section, we introduce the formulation and implementation of PointGrow, a new generative model for point cloud, which generates 3D shapes in a pointbypoint manner.
Unconditional PointGrow. A point cloud, S, that consists of points is defined as , and the point is expressed as in 3D space. Our goal is to assign a probability to each point cloud. We do so by factorizing the joint probability of S as a product of conditional probabilities over all its points:
(1) 
The value is the probability of the point given all its previous points, and can be computed as the joint probability over its coordinates:
(2) 
, where each coordinate is conditioned on all the previously generated coordinates. To facilitate the point cloud generation process, we sort points in the order of , and , which forces a shape to be generated in a “planesweep” manner along its primary axis ( axis). Following Oord et al. (2016) and Van Den Oord et al. (2016), we model the conditional probability distribution of each coordinate using a deep neural network. Prior art shows that a softmax discrete distribution works better than mixture models, even though the data are implicitly continuous. To obtain discrete point coordinates, we scale all point clouds to fall within the range [0, 1], and quantize their coordinates to uniformly distributed values. We use values as a tradeoff between generative performance and minimizing quantization artifacts. Other advantages of adopting discrete coordinates include (1) simplified implementation, (2) improved flexibility to approximate any arbitrary distribution, and (3) it prevent generating distribution mass outside of the range, which is common for continuous cases.
Context Awareness Operation. Context awareness improves model inference. For example, in Qi et al. (2017a) and Wang et al. (2018), a global feature is obtained by applying max pooling along each feature dimension, and then used to provide context information for solving semantic segmentation tasks. Similarly, we obtain “semiglobal” features for all sets of available points in the point cloud generation process, as illustrated in Figure 2 (left). Each row of the resultant features aggregates the context information of all the previously generated points dynamically by fetching and averaging. This Context Awareness (CA) operation is implemented as a plugin module in our model, and mean pooling is used in our experiments.
SelfAttention Context Awareness Operation. The CA operation summarizes point features in a fixed way via pooling. For the same purpose, we propose two alternative learningbased operations to determine the weights for aggregating point features. We define them as SelfAttention Context Awareness (SACA) operations, and the weights as selfattention weights.
The first SACA operation, SACAA, is shown in the middle of Figure 2. To generate selfattention weights, the SACAA first associates local and “semiglobal” information by concatenating input and “semiglobal” features after CA operation, and then passes them to a MultiLayer Perception (MLP). Formally, given a point feature matrix, F, with its row, , representing the feature vector of the point for , we compute the selfattention weight vector, , as below:
(3) 
, where is mean pooling, is concatenation, and is a sequence of fully connected layers. The selfattention weight encodes information about the context change due to each newly generated point, and is unique to that point. We then conduct elementwise multiplication between input point features and selfattention weights to obtain weighted features, which are accumulated sequentially to generate corresponding context features. The process to calculate the context feature, , can be expressed as:
(4) 
, where is elementwise multiplication. Finally, we shift context features downward by one row, because when estimating the coordinate distribution for point, , only its previous points, , are available. A zero vector of the same size is attached to the beginning as the initial context feature, since no previous point exists when computing features for the first point.
Figure 2 (right) shows the other SACA operation, SACAB. SACAB is similar to SACAA, except the way to compute and apply selfattention weights. In SACAB, the “semiglobal” feature after CA operation is shared by the first point features to obtain selfattention weights, which are then used to compute . This process can be described mathematically as:
(5) 
Compared to SACAA, SACAB selfattention weights encode the importance of each point feature under a common context, as highlighted in Eq. (4) and (5).
Learning happens only in MLP for both operations. In Figure 3, we plot the attention maps, which visualize Euclidean distances between the context feature of a selected point and the point features of its accessible points before SACA operation.
Model Architecture. Figure 4 shows the proposed network model to output conditional coordinate distributions. The top, middle and bottom branches model , and , respectively, for . The point coordinates are sampled according to the estimated softmax probability distributions. Note that the input points in the latter two cases are masked accordingly so that the network cannot see information that has not been generated. During the training phase, points are available to compute all the context features, thus coordinate distributions can be estimated in parallel. However, the point cloud generation is a sequential procedure, since each sampled coordinate needs to be fed as input back into the network, as demonstrated in Figure 1.
Conditional PointGrow. Given a condition or embedding vector, h, we hope to generate a shape satisfying the latent meaning of h. To achieve this, Eq. (1) and (2) are adapted to Eq. (6) and (7), respectively, as below:
(6) 
(7) 
The additional condition, h, affects the coordinate distributions by adding biases in the generative process. We implement this by changing the operation between adjacent fullyconnected layers from to , where and are feature vectors in the and layer, respectively, W is a weight matrix, H is a matrix that transforms h into a vector with the same dimension as , and is a nonlinear activation function. In this paper, we experimented with h as an onehot categorical vector which adds class dependent bias, and an highdimensional embedding vector of a 2D image which adds geometric constraint.
3 Experiments
Datasets. We evaluated the proposed framework on ShapeNet dataset (Chang et al. (2015)), which is a collection of CAD models. We used a subset consisting of 17,687 models across 7 categories: car, airplane, table, chair, bench, cabinet and lamp. To generate corresponding point clouds, we first sample 10,000 points uniformly from each mesh, and then use farthest point sampling to select 1,024 points among them representing the shape. Each category follows a split ratio of 0.9/0.1 to separate training and testing sets. ModelNet40 (Wu et al. (2015)) and PASCAL3D+ (Xiang et al. (2014)), are also used for further analysis. ModelNet40 contains CAD models from 40 categories, and we obtain their point clouds from Qi et al. (2017a). PASCAL3D+ is composed of PASCAL 2012 detection images augmented with 3D CAD model alignment, and used to demonstrate the generalization ability of conditional PointGrow.
3.1 Unconditional point cloud generation
Figure 5 shows point clouds generated by unconditional PointGrow. Since an unconditional model lacks knowledge about the shape category to generate, we train a separate model for each category. Figure 1 demonstrates point cloud generation for an airplane category. Note that no semantic information of discrete coordinates is provided during training, but the predicted distribution turns out to be categorically representative. (e.g.in the second row, the network model outputs a roughly symmetric distribution along axis, which describes the wings’ shape of an airplane.) The autoregressive architecture in PointGrow is capable of abstracting highlevel semantics even from unaligned point cloud samples.
Evaluation on Fidelity and Diversity. The negative loglikelihood is commonly used to evaluate autoregressive models for image and audio generation (Oord et al. (2016); Van Den Oord et al. (2016)). However, we observed inconsistency between its value and the visual quality of point cloud generation. It is validated by the comparison of two baseline models: CAMean and CAMax, where the SACA operation is replaced with the CA operation implemented by mean and max pooling, respectively. In Figure 6 (left), we report negative loglikelihoods in bits per coordinate on ShapeNet testing sets of airplane and car categories, and visualize their representative results. Despite CAMax shows lower negative loglikelihoods values, it gives less visually plausible results (i.e. airplanes lose wings and cars lose rear ends).
To faithfully evaluate the generation quality, we conduct user study w.r.t.two aspects, fidelity and diversity, among CAMax, CAMean, LatentGan (Achlioptas et al. (2017a)) and PointGrow (implemented with SACAA). We randomly select 10 generated airplane and car point clouds from each method. To calculate the fidelity score, we ask the user to score , or for each shape, and take the average of them. The diversity score is obtained by asking the user to scale from to with an interval of about the generated shape diversity within each method. 8 subjects without computer vision background participated in this test. We observe that (1) CAMean is more favored than CAMax, and (2) our PointGrow receives the highest preference on both fidelity and diversity.
Evaluation on Semantics Preserving. After generating point clouds, we perform classification as a measure of semantics preserving. More specifically, after training on ShapeNet training sets, we generated 300 point clouds per category (2,100 in total for 7 categories), and conducted two classification tasks: one is training on original ShapeNet training sets, and testing on generated shapes; the other is training on generated shapes, and testing on original ShapeNet testing sets. PointNet (Qi et al. (2017a)), a widelyuesd model, was chosen as the point cloud classifier. We implement two GANbased competing methods, 3DGAN (Wu et al. (2016)) and latentGAN (Achlioptas et al. (2017a)), to sample different shapes for each category, and also include CAMax and CAMean for comparison. The results are reported in Table 2. Note that the CAMean baseline achieves comparable performance against both GANbased competing methods. In the first classification task, our SACAA model outperforms existing models by a relatively large margin, while in the second task, SACAA and SACAB models show similar performance.
Methods  SG  GS 

3DGAN  82.7  83.4 
LatentGANCD  81.6  82.7 
Baseline (CAMax)  71.9  83.4 
Baseline (CAMean)  82.1  84.4 
Ours (SACAA)  90.3  91.8 
Ours (SACAB)  89.4  91.9 
Methods  Accuracy 

SPH (Kazhdan et al. (2003))  68.2 
LFD (Chen et al. (2003))  75.2 
TL Network (Girdhar et al. (2016))  74.4 
VConvDAE (Sharma et al. (2016))  75.5 
3DGAN (Wu et al. (2016))  83.3 
LatentGANEMD (Achlioptas et al. (2017a))  84.0 
LatentGANCD (Achlioptas et al. (2017a))  84.5 
Ours (SACAA)  85.8 
Ours (SACAB)  84.4 
Unsupervised Feature Learning. We next evaluate the learned point feature representations of the proposed framework, using them as features for classification. We obtain the feature representation of a shape by applying different types of “symmetric” functions as illustrated in Qi et al. (2017a) (i.e. min, max and mean pooling) on features of each layer before the SACA operation, and concatenate them all. Following Wu et al. (2016), we first pretrain our model on 7 categories from the ShapeNet dataset, and then use the model to extract feature vectors for both training and testing shapes of ModelNet40 dataset. A linear SVM is used for feature classification. We report our best results in Table 2. SACAA model achieves the best performance, and SACAB model performs slightly worse than LatentGANCD (Achlioptas et al. (2017a) uses 57,000 models from 55 categories of ShapeNet dataset for pretraining, while we use 17,687 models across 7 categories).
Shape Completion. Our model can also perform shape completion. Given an initial set of points, our model is capable of completing shapes in multiple ways. Figure 7 visualizes example predictions. The input points are sampled from ShapeNet testing sets, which have not been seen during the training process. The shapes generated by our model are different from the original ground truth point clouds, but look plausible. A current limitation of our model is that it works only when the input point set is given as the beginning part of a shape along its primary axis, since our model is designed and trained to generate point clouds along that direction. More investigation is required to complete shapes when partial point clouds are given from other directions.
3.2 Conditional Point Cloud Generation
SACAA is used to demonstrate conditional PointGrow here owing to its high performance.
Conditioned on Category Label. We first experiment with classconditional modelling of point clouds, given an onehot vector h with its nonzero element indicating the shape category. The onehot condition provides categorical knowledge to guide the shape generation process. We train the proposed model across multiple categories, and plausible point clouds for desired shape categories can be sampled (as shown in Figure 8). Failure cases are also observed: generated shapes present interwoven geometric properties from other shape types. For example, the airplane misses wings and generates a carlike body; the lamp and the car develop chair leg structures.
Conditioned on 2D Image. Next, we experiment with image conditions for point cloud generation. Image conditions apply constraints to the point cloud generation process because the geometric structures of sampled shapes should satisfy their 2D projections. In our experiments, we obtain an image condition vector by adopting the image encoder in Sun et al. (2018b) to generate a feature vector of 512 elements, and optimize it along with the rest of the model components. The model is trained on synthetic ShapeNet dataset, and one out of 24 views of a shape (provided by Choy et al. (2016)) is selected at each training step as the image condition input. The trained model is also tested on real images from the PASCAL3D+ dataset to prove its generalizability. For each input, we removed the background, and cropped the image so that the target object is centered. The PASCAL3D+ dataset is challenging because the images are captured in real environments, and contain noisier visual signals which are not seen during the training process. We show ShapeNet testing images and PASCAL3D+ real images together with their sampled point cloud results on Figure 9 upper left.
We further quantitatively evaluate the conditional generation results by calculating mean IntersectionoverUnion (mIoU) with ground truth volumes. Here we only consider 3D volumes containing more than 500 occupied voxels, and using furthest point sampling method to select 500 out of them to describe the shape. To compensate for the sampling randomness of PointGrow, we slightly align generated points to their nearest ground truth voxels within a neighborhood of 2voxel radius. As shown in Table 3, PointGrow achieves abovepar performance on conditional 3D shape generation.
airplane  bench  cabinet  car  chair  monitor  lamp  speaker  firearm  couch  table  cellphone  watercraft  Avg.  

3DR2N2 (1 view)  0.513  0.421  0.716  0.798  0.466  0.468  0.381  0.662  0.544  0.628  0.513  0.661  0.513  0.560 
3DR2N2 (5 views)  0.561  0.527  0.772  0.836  0.550  0.565  0.421  0.717  0.600  0.706  0.580  0.754  0.610  0.631 
PointOutNet  0.601  0.550  0.771  0.831  0.544  0.552  0.462  0.737  0.604  0.708  0.606  0.749  0.611  0.640 
Ours  0.742  0.629  0.675  0.839  0.537  0.567  0.560  0.569  0.674  0.676  0.590  0.729  0.737  0.656 
3.3 Learned Image Condition Manifold
Image Condition Interpolation. Linearly interpolated embedding vectors of image pairs can be conditioned upon to generate linearly interpolated geometrical shapes. As shown on Figure 9 bottom, walking over the embedded image condition space gives smooth transitions between different geometrical properties of different types of cars.
Image Condition Arithmetic. Another interesting way to impose the image conditions is to perform arithmetic on them. We demonstrate this by combining embedded image condition vectors with different weights. Examples of this kind are shown on Figure 9 upper right. Note that the generated final shapes contain geometrical features shown in generated shapes of their operands.
4 Conclusion
In this work, we propose PointGrow, a new generative model that can synthesize realistic and diverse point cloud with high resolution. Unlike previous works that rely on local regularities to synthesize 3D shapes, our PointGrow builds upon autoregressive architecture to encode the diverse surface information of point cloud. To further capture the longrange dependencies between points, two dedicated selfattention modules are designed and carefully integrated into our framework. PointGrow as a generative model also enables effective unsupervised feature learning, which is extremely useful for lowdata recognition tasks. Finally, we show that PointGrow learns a smooth image condition manifold where 3D shape interpolation and arithmetic calculation can be performed inside.
References
 Achlioptas et al. (2017a) Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Representation learning and adversarial generation of 3d point clouds. arXiv preprint arXiv:1707.02392, 2017a.
 Achlioptas et al. (2017b) Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. arXiv preprint arXiv:1707.02392, 2017b.
 Bogo et al. (2016) Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In ECCV, 2016.
 Chang et al. (2015) Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
 Chen et al. (2003) DingYun Chen, XiaoPei Tian, YuTe Shen, and Ming Ouhyoung. On visual similarity based 3d model retrieval. In Computer graphics forum, 2003.
 Choy et al. (2016) Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. In ECCV, 2016.
 Díaz Vilariño et al. (2016) Lucía Díaz Vilariño, Pawel Boguslawski, Kourosh Khoshelham, Henrique Lorenzo, and Lamine Mahdjoubi. Indoor navigation from point clouds: 3d modelling and obstacle detection. International Society for Photogrammetry and Remote Sensing, 2016.
 Fan et al. (2017) Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR, 2017.
 Girdhar et al. (2016) Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and generative vector representation for objects. In ECCV, 2016.
 Kazhdan et al. (2003) Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Rotation invariant spherical harmonic representation of 3 d shape descriptors. In Symposium on geometry processing, 2003.
 Klokov & Lempitsky (2017) Roman Klokov and Victor Lempitsky. Escape from cells: Deep kdnetworks for the recognition of 3d point cloud models. In ICCV, 2017.
 Kurenkov et al. (2018) Andrey Kurenkov, Jingwei Ji, Animesh Garg, Viraj Mehta, JunYoung Gwak, Christopher Choy, and Silvio Savarese. Deformnet: Freeform deformation network for 3d shape reconstruction from a single image. In WACV, 2018.
 Li et al. (2018) Yangyan Li, Rui Bu, Mingchao Sun, and Baoquan Chen. Pointcnn. arXiv preprint arXiv:1801.07791, 2018.
 Lin et al. (2017a) ChenHsuan Lin, Chen Kong, and Simon Lucey. Learning efficient point cloud generation for dense 3d object reconstruction. arXiv preprint arXiv:1706.07036, 2017a.
 Lin et al. (2017b) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured selfattentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017b.
 Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 Qi et al. (2017a) Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. CVPR, 2017a.
 Qi et al. (2017b) Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017b.
 Sharma et al. (2016) Abhishek Sharma, Oliver Grau, and Mario Fritz. Vconvdae: Deep volumetric shape learning without object labels. In ECCV, 2016.
 Shen et al. (2018) Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Mining point cloud local structures by kernel correlation and graph pooling. In CVPR, 2018.
 Stets et al. (2017) Jonathan Dyssel Stets, Yongbin Sun, Wiley Corning, and Scott W Greenwald. Visualization and labeling of point clouds in virtual reality. In SIGGRAPH Asia 2017 Posters, 2017.
 Sun et al. (2018a) Yongbin Sun, Sai Nithin R Kantareddy, Rahul Bhattacharyya, and Sanjay E Sarm. Xvision: An augmented vision tool with realtime sensing ability in tagged environments. arXiv preprint arXiv:1806.00567, 2018a.
 Sun et al. (2018b) Yongbin Sun, Ziwei Liu, Yue Wang, and Sanjay E Sarma. Im2avatar: Colorful 3d reconstruction from a single image. arXiv preprint arXiv:1804.06375, 2018b.
 Van Den Oord et al. (2016) Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In SSW, 2016.
 Velickovic et al. (2017) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
 Wang et al. (2018) Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
 Wu et al. (2016) Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In NIPS, 2016.
 Wu et al. (2015) Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
 Xiang et al. (2014) Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In WACV, 2014.
 Xie et al. (2018) Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. Attentional shapecontextnet for point cloud recognition. In CVPR, 2018.
 Yang et al. (2017) Bo Yang, Hongkai Wen, Sen Wang, Ronald Clark, Andrew Markham, and Niki Trigoni. 3d object reconstruction from a single depth view with adversarial learning. arXiv preprint arXiv:1708.07969, 2017.
 You et al. (2018) Haoxuan You, Yifan Feng, Rongrong Ji, and Yue Gao. Pvnet: A joint convolutional network of point cloud and multiview for 3d shape recognition. arXiv preprint arXiv:1808.07659, 2018.
 Yu et al. (2018) Lequan Yu, Xianzhi Li, ChiWing Fu, Daniel CohenOr, and PhengAnn Heng. Punet: Point cloud upsampling network. In CVPR, 2018.
 Yue et al. (2018) Xiangyu Yue, Bichen Wu, Sanjit A Seshia, Kurt Keutzer, and Alberto L SangiovanniVincentelli. A lidar point cloud generator: from a virtual world to autonomous driving. In ICMR, 2018.
 Zhang et al. (2018) Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Selfattention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
 Zuffi et al. (2018) Silvia Zuffi, Angjoo Kanazawa, and Michael J Black. Lions and tigers and bears: Capturing nonrigid, 3d, articulated shape from images. In CVPR, 2018.