CompoNet: Learning to Generate the Unseen by Part Synthesis and Composition
Abstract
Datadriven generative modeling has made remarkable progress by leveraging the power of deep neural networks. A reoccurring challenge is how to enable a model to generate a rich variety of samples from the entire target distribution, rather than only from a distribution confined to the training data. In other words, we would like the generative model to go beyond the observed samples and learn to generate “unseen”, yet still plausible, data. In our work, we present CompoNet, a generative neural network for 2D or 3D shapes that is based on a partbased prior, where the key idea is for the network to synthesize shapes by varying both the shape parts and their compositions. Treating a shape not as an unstructured whole, but as a (re)composable set of deformable parts, adds a combinatorial dimension to the generative process to enrich the diversity of the output, encouraging the generator to venture more into the “unseen”. We show that our partbased model generates richer variety of plausible shapes compared with baseline generative models. To this end, we introduce two quantitative metrics to evaluate the diversity of a generative model and assess how well the generated data covers both the training data and unseen data from the same target distribution. Code is available at https://github.com/nschor/CompoNet.
“For object recognition, the visual system decomposes shapes into parts, , parts with their descriptions and spatial relations provide a first index into a memory of shapes.”
— Hoffman & Richards [18]
1 Introduction
Learning generative models of shapes and images has been a long standing research problem in visual computing. Despite the remarkable progress made, an inherent and reoccurring limitation still remains: a generative model is often only as good as the given training data, as it is always trapped or bounded by the empirical distribution of the observed data. More often than not, what can be observed is not sufficiently expressive of the true target distribution. Hence, the generative power of a learned model should not only be judged by the plausibility of the generated data as confined by the training set, but also its diversity, in particular, by the model’s ability to generate plausible data that is sufficiently far from the training set. Since the target distribution which encompasses both the observed and unseen data is unknown, the main challenge is how to effectively train a network to learn to generate the “unseen”, without making any compromising assumption about the target distribution. Due to the same reason, even evaluating the generative power of such a network is a nontrivial task.
We believe that the key to generative diversity is to enable more drastic changes, i.e., nonlocal and/or structural transformations, to the training data. At the same time, such changes must be within the confines of the target data distribution. In our work, we focus on generative modeling of 2D or 3D shapes, where the typical modeling constraint is to produce shapes belonging to the same category as the exemplars, e.g., chairs or vases. We develop a generative deep neural network based on a partbased prior. That is, we assume that shapes in the target distribution are composed of parts, e.g., chair backs or airplane wings. The network, coined CompoNet, is designed to synthesize novel parts, independently, and compose them to form a complete shape.
It is wellknown that object recognition is intricately tied to reasoning about parts and part relations [18, 43]. Hence, building a generative model based on varying parts and their compositions, while respecting categoryspecific part priors, is a natural choice and also facilitates grounding of the generated data to the target object category. More importantly, treating a shape as a (re)composable set of parts, instead of a whole entity, adds a combinatorial dimension to the generative model and improves its diversity. By synthesizing parts independently and then composing them, our network enables both part variation and novel combination of parts, which induces nonlocal and more drastic shape transformations. Rather than sampling only a single distribution to generate a whole shape, our generative model samples both the geometric distributions of individual parts and the combinatorial varieties arising from part compositions, which encourages the generative process to venture more into the “unseen”, as shown in Figure 1.
While the partbased approach is generic and not strictly confined to specific generative network architecture, we develop a generative autoencoder to demonstrate its potential. Our generative AE consists of two parts. In the first, we learn a distinct partlevel generative model. In the second stage, we concatenate these learned latent representation with a random vector, to generate a new latent representation for the entire shape. These latent representations are fed into a conditional parts compositional network, which is based on a spatial transformer network (STN) [22].
We are not the first to develop deep neural networks for partbased modeling. Some networks learn to compose images [30, 3] or 3D shapes [23, 6, 53], by combining existing parts sampled from a training set or provided as input to the networks. In contrast, our network is fully generative as it learns both novel part synthesis and composition. Wang \etal [44] train a generative adversarial network (GAN) to produce semantically segmented 3D shapes and then refine the part geometries using an autoencoder network. Li \etal [28] train a VAEGAN to generate structural hierarchies formed by bounding boxes of object parts and then fill in the part geometries using a separate neural network. Both works take a coarsetofine approach and generate a rough 3D shape holistically from a noise vector. In contrast, our network is trained to perform both part synthesis and part composition (with noise augmentation); see Figure 2. Our method also allows the generation of more diverse parts, since we place less constraints per part while holistic models are constrained to generate all parts at once.
We show that the partbased CompoNet produces plausible outputs that better cover unobserved regions of the target distribution, compared to baseline approaches, e.g., [1]. This is validated over random splits of a set of shapes belonging to the same category into a “seen” subset, for training, and an “unseen” subset. In addition, to evaluate the generative power of our network relative to baseline approaches, we introduce two quantitative metrics to assess how well the generated data covers both the training data and the unseen data from the same target distribution.
2 Background and Related Work
Generative neural networks.
In recent years, generative modeling has gained much attention within the deep learning framework. Two of the most commonly used deep generative models are variational autoencoders (VAE) [25] and generative adversarial networks (GAN) [15]. Both methods have made remarkable progress in image and shape generation problems [47, 21, 37, 54, 45, 49, 44].
Many works are devoted to improve the basic models and their training. In [16, 31, 4], new cost functions are suggested to achieve smooth and nonvanishing gradients. Sohn \etal[41] and Odena \etal[35] proposed conditional generative models, based on VAE and GAN, respectively. Hoang \etal [17] train multiple generators to explore different modes of the data distribution. Similarly, MIXGAN [2] uses a mixture of generators to improve diversity of the generated distribution, while a combination of multiple discriminators and a single generator aims at constructing a stronger discriminator to guide the generator. GMAN [10] explores an array of discriminators to boost generator learning. Some methods [20, 29, 51] use a global discriminator together with multiple local discriminators.
Following PointNet [36], a generative model that works directly on point clouds was proposed. Achlioptas \etal [1] proposed an AE+GMM generative model for point clouds, which is considered stateoftheart.
Our work is orthogonal to these methods. We address the case where the generator is unable to generate other valid samples since they are not well represented in the training data. We show that our partbased priors can assist the generation process and extend the generator’s capabilities.
Learningbased shape synthesis.
Li \etal [28] present a topdown and structureoriented approach for 3D shape generation. They learn symmetry hierarchies [46] of shapes with an autoencoder and generate variations of these hierarchies using an VAEGAN. The nodes of the hierarchies are independently instantiated with parts. However, these parts are not necessarily connected and their aggregation does not form a coherent connected shape. In our work, the shapes are generated coherently as a whole, and special care is given to the interparts relation and their connectivity.
Most relevant to our work is the shape variational autoencoder by Nash and Williams [34], where a pointcloud based autoencoder is developed to learn a lowdimensional latent space. Then novel shapes can be generated by sampling vectors in the learned space. Like our method, the generated shapes are segmented into semantic parts. In contrast however, they require a onetoone dense correspondence among the training shapes, since they represent the shapes as an order vector. Their autoencoder learns the overall (global) 3D shapes with no attention to the local details. Our approach pays particular attention to both the generated parts and their composition.
Inverse procedural modeling
aims to learn a generative procedure from a given set of exemplars. Some recent works, e.g., [38, 52, 39], have focused on developing neural models, such as autoencoders, to generate the shape synthesis procedures or programs. However, current inverse procedural modeling methods are not designed to generate unseen data that are away from the exemplars.
Assemblybased synthesis.
The early and seminal work of Funkhouser \etal [14] composes new shapes by retrieving relevant shapes from a repository, extracting shape parts, and gluing them together. Many followup works [5, 42, 7, 23, 48, 24, 12, 19] improve the modeling process with more sophisticated techniques that consider the part relations or shape structures, e.g., employing Bayesian networks or modular templates. We refer to recent surveys [33, 32] for an overview of these and related works.
In the image domain, recent works [30, 3] develop neural networks to assemble images or scenes from existing components. These works utilized an STN [22] to compose the components to a coherent image/scene. In our work, an STN is integrated as an example for prior information regarding the data generation process. In contrast to previous works, we first synthesize parts using multiple generative AEs and then employ an STN to compose the parts.
Recent concurrent efforts [9, 27] also propose deep neural networks for shape modeling using a partbased prior, but on voxelized representations. Dubrovina \etal [9] encode shapes into a factorized embedding space, where shape composition and decomposition become simple linear operations on the embedding coordinates, allowing both shape reconstruction and part exchange. While this work was not going after generative diversity, the network of Li \etal [27] also combines part generation with assembly. Their results reinforce our premise that shape generation using part synthesis and composition does improve diversity, which is measured using inception scores in their work.
3 Method
In this section, we present CompoNet, our generative model which learns to synthesize shapes that can be represented as a composition of distinct parts. At training time, every shape is presegmented to its semantic parts, and we assume that the parts are independent of each other. Thus, every combination of parts is valid, even if the training set may not include it. As shown in Figure 2, CompoNet consists of two units: a generative model of parts and a unit that combines the generated parts into a global shape.
3.1 Part synthesis unit
We first train a generative model that estimates the marginal distribution of each part separately. In the 2D case, we use a standard VAE as the part generative model, and train an individual VAE for each semantic part. Thus, each part is fed into a different VAE and is mapped onto a separate latent distribution. The encoder consists of several convolutional layers followed by LeakyReLU activation functions. The final layer of the encoder is a fully connected layer producing the latent distribution parameters. Using the reparameterization trick, the latent distribution is sampled and decoded to reconstruct each individual input part. The decoder mirrors the encoder network, applying a fully connected layer followed by transposed convolution layers with ReLU nonlinearity functions. In the 3D case, we borrow an idea from Achlioptas \etal [1], and replace the VAE with an AE+GMM, where we approximate the latent space of the AE by using a GMM. The encoder is based on PointNet [36] architecture and the decoder consists of fullyconnected layers. The part synthesis process is visualized in Figure 2, part synthesis unit.
Once the part synthesis unit is trained, the part encoders are fixed, and are used to train the part composition unit.
3.2 Parts composition unit
This unit composes the different parts into a coherent shape. Given a shape and its parts, where missing parts are represented by null shapes (i.e., zeros), the pretrained encoders encode the corresponding parts (marked in blue in Figure 2). At training time, these generated codes are fed into a composition network which learns to produce transformation parameters per part (scale and translation), such that the composition of all the parts forms a coherent complete shape. The loss measures the similarity between the input shape and the composed shape. We use IntersectionoverUnion (IoU) as our metric in the 2D domain, and Chamfer distance for the 3D domain, where the Chamfer distance is given by
(1) 
where and are point clouds which represent the 3D shapes. Note that the composition network yields a set of affine (similarity) transformations, which are applied on the input parts, and does not directly synthesizes the output.
The composition network does not learn the composition based solely on part codes, but also relies on an input noise vector. This network is another generative model on its own, generating the scale and translation from the noise, conditioned on the codes of the semantic parts. This additional generative model enriches the variation of the generated shapes, beyond the generation of the parts.
3.3 Novel shape generation
At inference time, we sample the composition vector from a normal distribution. In the 2D case, since we use VAEs, we sample the part codes from normal distribution as well. For 3D, we sample the code of each part from its GMM distribution, randomly sampling one of the Gaussians. When generating a new shape with a missing part, we use the embedding of the part’s null vector, and synthesize the shape from that compound latent vector; see Figure 3. We feed each section of the latent vector representing a part to its associated pretrained decoder from the part synthesis unit, to generate novel parts. In parallel, the entire shape representation vector is fed to the composition network to generate scale and translation parameters for each part. The synthesized parts are then warped according to the generated transformations and combined to form a novel shape.
4 Architecture and implementation details
The backbone architecture of our part based synthesis is an AE: VAE for 2D and AE+GMM for 3D.
4.1 Partbased generation
2D Shapes.
The input parts are assumed to have a size of . We denote () as a 2D convolution (transpose convolution) layer with filters of size and stride , followed by batch normalization and a leakyReLU (ReLU) activation. A fullyconnected layer with outputs is denoted by . The encoder takes a 2D part as input and has the structure of . The decoder mirrors the encoder as , where in the last layer, , we omitted batch normalization and replaced ReLU by a Sigmoid activation. The output of the decoder is equal in size to the 2D part input, (). We use an Adam optimizer with learning rate , and . The batch size is set to .
3D point clouds.
Our input parts are assumed to have a fixed number of point for each part. Different parts can vary in number of points, but this becomes immutable once training has started. We used points per part. We denote as a featurewise maxpooling layer and as a 1D convolution layer with filters of size and stride , followed by a batch normalization layer and a ReLU activation function. The encoder takes a part with as input. The encoder structure is . The decoder consist of fullyconnected layers. We denote to be a fullyconnected layer with output nodes, followed by a batch normalization layer and a ReLU activation function. The decoder takes a latent vector of size as input. The decoder structure is , where in the last layer, , we omitted the batchnormalization layer and the ReLU activation function. The output of the decode is equal in size to the input (). For each AE, we use a GMM with Gaussians, to model their latent space distribution. We use an Adam optimizer with learning rate , and . The batch size is set to .
4.2 Part composition
2d.
The composition network encodes each semantic part by the associated pretrained VAE encoder, producing a dim vector for each part. The composition noise vector is set to be dim. The part codes are concatenated together with the noise, yielding a dim vector. The composition network structure is . Each fully connected layer is followed by a batch normalization layer, a ReLU activation function, and a Dropout layer with keep rate of , except for the last layer. The last layer outputs a dim vector, four values per part. These four values represent the scale and translation in the and axes. We use the grid generator and sampler, suggested by [22], to perform differential transformation. The scale is initialized to and the translation to . We use a perpart IoU loss and an Adam optimizer with learning rate , and . The batch size is set .
3d.
The composition network encodes each semantic part by the associated pretrained AE encoder, producing a dim vector for each part. The composition noise vector is set to size . The parts codes are concatenated together with the noise vector, yielding a dim vector. The composition network structure is . Each fully connected layer is followed by a batch normalization layer and a ReLU activation function, except for the last layer. The last layer outputs a dim vector, six values per part. These six values represent the scale and translation in the , and axes. The scale is initialized to and the translation to . We then reshape the output vector to match an affine transformation matrix:
(2) 
The task of performing an affine transformation on point clouds is easy, we simply concatenate to each point and multiply the transformation matrix with each point. We use Chamfer distance loss and an Adam optimizer with learning rate , and . The batch size is set .
5 Results and evaluation
In this section, we analyze the results of applying our generative approach to 2D and 3D shape collections.
5.1 Datasets
Projected COSEG.
We used the COSEG dataset [40] which consists of vases, segmented to four different semantic labels: top, handle, body and base (each vase may or may not contain any of these parts). Similar to the projection procedure in [13], each vase is projected from the main view to constitute a collection of 300 silhouettes of size , where each semantic part is stored in a different channel. In addition, we create four sets, one per part. The parts are normalized by finding their axisaligned bounding box and stretching it to a resolution.
ShapeNet.
For 3D data, we chose to demonstrate our method on point clouds taken from ShapeNet part dataset [50]. We chose to focus on two categories: chairs and airplanes. Point clouds, compared to 3D voxels, enable higher resolution while keeping the model complexity relatively low. Similar to the 2D case, each shape is divided into its semantic parts (Chair: legs, back, seat and armrests, Airplane: tail, body, engine and wings). We first normalize each shape to the unit square. We require an equal number of points in each point cloud, thus, we randomly sample each part to points. If a part consists of points, we randomly duplicate of its points (since our nonlocal operation preforms only max global pooling, the duplication of points has no affect on the embedding of the shape). This random sampling process occurs every epoch. For consistency between the shape and its parts, we first normalize the original parts to the unit square, and only then sample (or duplicate) the same points that were selected to generate the complete sampled shape.
Seen and Unseen splits.
To properly evaluate the diversity of our generative model, we divide the resulting collections into two subsets: (i) training (seen) set and (ii) unseen set. The term unseen emphasizes that unlike the nominal division into train and test sets, the unseen set is not represented well in the training set. Thus, there exists an unbridgeable gap, by an holistic approach, between the unseen set and the training set. To avoid bias during evaluation, we preform several random splits for each seenunseen split percentage (e.g.,  seenunseen in the 3D case; see Table 1). In the 2D case, since the dataset is much smaller, we used  split between the training and unseen sets. In both cases, the unseen set is used to evaluate the ability of a model to generate diverse shapes.
5.2 Baselines
For 2D shapes, we use a naive model  a onechannel VAE. Its structure is identical to the part VAE with a latent space of dim. We feed it with a binary representation of the data (a silhouette) as input. We use an Adam optimizer with learning rate , and . The batch size is set to . In the 3D case, we use two baselines; (i) a WGANGP [16] for point clouds and (ii) an AE+GMM model [1], which generates remarkable 3D point cloud results. We train the baseline models using our points per part data set ( per shape). We use [1] official implementation and parameters, which also includes the WGANGP implementation.
5.3 Qualitative evaluation
We evaluate our network, CompoNet, on 2D data and 3D point clouds. Figure 4 shows some generated 3D results. Unlike other naive approaches, we are able to generate versatile samples beyond the empirical distribution. In order to visualize the versatility, we present the nearest neighbor of the generated samples in the training set. As shown in Figure 5, for the 2D case, samples generated by our generative approach differ from the closest training samples. In Figure 6 we also compare this qualitative diversity measure with the baseline [1], showing that our generated samples are more distinct from their nearest neighbors in the training set, compared to those generated by the baseline. In the following sections, we quantify this attribute. More generated results can be found in the supplementary material.
5.4 Quantitative evaluation
We quantify the ability of our model to generate realistic unseen samples using two novel metrics. To evaluate our model, we use 5,000 randomly sampled shapes from our trained model and from the baselines.
setcoverage.
We define the setcoverage of set by set as the percentage of shapes from which are one of the nearestneighbors of some shape in . Thus, if the set is similar only to a small part of set , the SetCoverage will be small and viceversa. In our case, we calculate the nearest neighbors using Chamfer distance. In Figure 7, we compare the setcoverage of the unseen set and the training set by our generated data and by the baseline [1] generated data. It is clear that the baseline covers the training better, since most of its samples lie close to it. However, the unseen set is covered poorly by the baseline, for all , while, our method, balances between generating seen samples and unseen samples.
Diversity.
We develop a second measure to quantify the generated unseen data, which relies on a trained classifier to distinguish between the training and the unseen set. Then, we measure the percentage of generated shapes which are classified as belonging to the unseen set. The classifier architecture is a straightforward adaption of the encoder from the part synthesis unit of the training process, followed by fully connected layers which classify between unseen and train sets (see supplementary file for details).
Table 1 shows some classification results for generated vases, chairs, and airplanes by our method and two baselines. We can observe that when the seen set is relatively small, e.g., or of the total, our model clearly performs better than the baselines in terms of generative diversity, as exhibited by the higher levels of coverage over the unseen set. However, as the seen set increases in size, e.g., to , the difference between our method and the baselines becomes smaller. We believe that this trend is not a reflection that our method starts to generate less diverse samples, but rather that the unseen set is becoming more similar to the seen set, hence less diverse itself.
To visualize the coverage of seen/unseen regions by the generated samples, we use the classifier’s embedding (the layer before the final fullyconnected layer) and reduce its dimension by projecting it onto the 2D PCA plane, as shown in Figures 1 and 8. The training and unseen sets have overlap in this representation, reflecting data which is similar between the two sets. While both methods are able to generate unseen samples in the overlap region, the baseline samples are biased toward the training set. In contrary, our generated samples are closer to the unseen.



Jsd.
The JensenShannon divergence is a distance measure between two probability distributions and is given by
(3) 
where , are probability distributions, and is the KLdivergence [26]. Following [1] we define the occupancy probability distribution for a set of point clouds by counting the number of points lying within each voxel in a regular voxel grid. Assuming the point clouds are normalized and axisaligned, the JSD between two such probabilities measure the degree to which two point cloud sets occupy similar locations. Thus, we calculate the occupancy matrix for the unseen set and compare it to the occupancy matrix of samples generated by our method and the baselines. The results are summarized in Table 2 and clearly show that our generated samples are closer to the unseen set structure. The number of voxels used is , as in [1] and the size of all point clouds sets is equal.
Category  WGANGP  AE+GMM [1]  Ours 

Chair  0.30.02  0.190.006  0.020.003 
Airplanes  0.320.016  0.140.007  0.070.013 
6 Conclusion, limitation, and future work
We believe that effective generative models should strive to venture more into the “unseen” data of a target distribution, beyond the observed exemplars from the training set. Covering both the seen and the unseen implies that the generated data is both fit and diverse [48]. Fitness constrains the generated data to be close to data from the target domain, both the seen and the unseen. Diversity ensures that the generated data is not confined only to the seen data.
We have presented a generic approach for “fitndiverse” shape modeling based on a partbased prior, where a shape is not viewed as an unstructured whole but as the result of a coherent composition of a set of parts. This is realized by CompoNet, a novel deep generative network composed of a part synthesis unit and a part composition unit. Novel shapes are generated via inference over random samples taken from the latent spaces of shape parts and part compositions. Our work also contributes two novel measures to evaluate generative models: the setcoverage and a diversity measure which quantifies the percentage of generated data classified as “unseen” vs. data from the training set.
Compared to baseline approaches, our generative network demonstrates superiority, but still somewhat limited diversity, since the generative power of the partbased approach is far from being fully realized. Foremost, an intrinsic limitation of our composition mechanism is that it is still “in place”: it does not allow changes to part structures or feature transfers between different part classes. For example, enabling a simple symmetric switch in part composition would allow the generation of right hand images when all the training images are of the left hand.
CompoNet can be directly applied for generative modeling of organic shapes. But in terms of plausibility, such shapes place a more stringent requirement on coherent and smooth part connections, an issue that our current method does not account for. Perfecting part connections can be a post process. Learning a deep model for the task is worth pursuing for future work. Our current method is also limited by the spatial transformations allowed by the STN during part composition. As a result, we can only deal with manmade shapes without part articulation.
As more immediate future work, we would like to apply our approach to more complex datasets, where parts can be defined during learning. In general, we believe that more research shall focus on other generation related prior information, besides partsbased priors. Further down the line, we envision that the fitndiverse approach, with generative diversity, will form a baseline for creative modeling [8], potentially allowing part exchanges across different object categories. This may have to involve certain perceptual studies or scores, to judge creativity. The compelling challenge is how to define a generative neural network with sufficient diversity to cross the line of being creative [11].
References
 [1] (2017) Learning representations and generative models for 3d point clouds. arXiv preprint arXiv:1707.02392. Cited by: §1, §2, §3.1, Figure 6, Figure 6, Figure 7, Figure 8, §5.2, §5.3, §5.4, §5.4, 0(a), 0(b), 0(c), Table 2.
 [2] (2017) Generalization and equilibrium in generative adversarial nets (GANs). In Proc. Int. Conf. on Machine Learning, Vol. 70, pp. 224–232. Cited by: §2.
 [3] (2018) Compositional gan: learning conditional image composition. arXiv preprint arXiv:1807.07560. Cited by: §1, §2.
 [4] (2017) BEGAN: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §2.
 [5] (2010) A connection between partial symmetry and inverse procedural modeling. ACM Transactions on Graphics (TOG) 29 (4), pp. 104:1–104:10. Cited by: §2.
 [6] (2011) Probabilistic reasoning for assemblybased 3D modeling. ACM Trans. on Graphics 30 (4), pp. 35:1–35:10. Cited by: §1.
 [7] (2010) Datadriven suggestions for creativity support in 3D modeling. ACM Trans. on Graphics 29 (6), pp. 183:1–183:10. Cited by: §2.
 [8] (2016) From inspired modeling to creative modeling. The Visual Computer 32 (1), pp. 1–8. Cited by: §6.
 [9] (2019) Composite shape modeling via latent space factorization. arXiv preprint arXiv:1901.02968. Cited by: §2.
 [10] (2016) Generative multiadversarial networks. arXiv preprint arXiv:1611.01673. Cited by: §2.
 [11] (2017) CAN: creative adversarial networks, generating ”art” by learning about styles and deviating from style norms. CoRR abs/1706.07068. External Links: Link, 1706.07068 Cited by: §6.
 [12] (2014) Metarepresentation of shape families. ACM Trans. on Graphics 33 (4), pp. 34:1–34:11. Cited by: §2.
 [13] (2016) Structureoriented networks of shape collections. ACM Transactions on Graphics (TOG) 35 (6), pp. 171. Cited by: §5.1.
 [14] (2004) Modeling by example. ACM Trans. on Graphics 23 (3), pp. 652–663. Cited by: §2.
 [15] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
 [16] (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems (NIPS), pp. 5769–5779. Cited by: §2, §5.2.
 [17] (2017) Multigenerator gernerative adversarial nets. arXiv preprint arXiv:1708.02556. Cited by: §2.
 [18] (1984) Parts of recognition. Cognition, pp. 65–96. Cited by: §1, CompoNet: Learning to Generate the Unseen by Part Synthesis and Composition.
 [19] (2015) Analysis and synthesis of 3d shape families via deeplearned generative models of surfaces. In Computer Graphics Forum, Vol. 34, pp. 25–38. Cited by: §2.
 [20] (2017) Globally and locally consistent image completion. ACM Trans. on Graphics 36 (4), pp. 107:1–107:14. Cited by: §2.
 [21] (2017) Imagetoimage translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.
 [22] (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §1, §2, §4.2.
 [23] (2012) A probabilistic model for componentbased shape synthesis. ACM Trans. on Graphics 31 (4), pp. 55:1–55:11. Cited by: §1, §2.
 [24] (2013) Learning partbased templates from large collections of 3D shapes. ACM Trans. on Graphics 32 (4), pp. 70:1–70:12. Cited by: §2.
 [25] (2014) Autoencoding variational bayes. In Proc. Int. Conf. on Learning Representations, Cited by: §2.
 [26] (195103) On information and sufficiency. Ann. Math. Statist. 22 (1), pp. 79–86. External Links: Document, Link Cited by: §5.4.
 [27] (2019) Learning part generation and assembly for structureaware shape synthesis. arXiv preprint arXiv:1906.06693. Cited by: §2.
 [28] (2017) Grass: generative recursive autoencoders for shape structures. ACM Transactions on Graphics (TOG) 36 (4), pp. 52. Cited by: §1, §2.
 [29] (2018) Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723. Cited by: §2.
 [30] (2018) STgan: spatial transformer generative adversarial networks for image compositing. In Computer Vision and Pattern Recognition, 2018. CVPR 2018. IEEE Conference on, pp. –. Cited by: §1, §2.
 [31] (2017) Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2813–2821. Cited by: §2.
 [32] (2013) Structureaware shape processing. In SIGGRAPH Asia 2013 Courses, pp. 1:1–1:20. Cited by: §2.
 [33] (2013) Structureaware shape processing. Computer Graphics Forum (Eurographics Stateoftheart Report), pp. 175–197. Cited by: §2.
 [34] (2017) The shape variational autoencoder: a deep generative model of partsegmented 3D objects. Computer Graphics Forum 36 (5), pp. 1–12. Cited by: §2.
 [35] (2016) Semisupervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583. Cited by: §2.
 [36] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §2, §3.1.
 [37] (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §2.
 [38] (2016) Neurallyguided procedural models: amortized inference for procedural graphics programs using neural networks. In NIPS, pp. 622–630. Cited by: §2.
 [39] (201806) CSGNet: neural shape parser for constructive solid geometry. In CVPR, Cited by: §2.
 [40] (2011) Unsupervised cosegmentation of a set of shapes via descriptorspace spectral clustering. ACM Trans. on Graphics 30 (6), pp. Article 126. Cited by: §5.1.
 [41] (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491. Cited by: §2.
 [42] (2012) Learning design patterns with bayesian grammar induction. In Proc. ACM Symp. on User Interface Software and Technology, pp. 63–74. Cited by: §2.
 [43] (1992) On growth and form. Dover reprint of 1942 2nd ed.. Cited by: §1.
 [44] (2018) Globaltolocal generative model for 3d shapes. ACM Transactions on Graphics (TOG) 37 (6), pp. 00. Cited by: §1, §2.
 [45] (2016) Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pp. 318–335. Cited by: §2.
 [46] (2011) Symmetry hierarchy of manmade objects. Computer Graphics Forum. Cited by: §2.
 [47] (2016) Learning a probabilistic latent space of object shapes via 3D generativeadversarial modeling. In Advances in Neural Information Processing Systems (NIPS), pp. 82–90. Cited by: §2.
 [48] (2012) Fit and diverse: set evolution for inspiring 3D shape galleries. ACM Trans. on Graphics 31 (4), pp. 57:1–57:10. Cited by: §2, §6.
 [49] (2016) Attribute2image: conditional image generation from visual attributes. In Proc. Euro. Conf. on Computer Vision, pp. 776–791. Cited by: §2.
 [50] (2016) A scalable active framework for region annotation in 3D shape collections. ACM Trans. on Graphics 35 (6), pp. 210:1–210:12. Cited by: §5.1.
 [51] (2018) Generative image inpainting with contextual attention. In Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pp. 5505–5514. Cited by: §2.
 [52] (2015) Procedural modeling using autoencoder networks. In ACM UIST, pp. 109–118. Cited by: §2.
 [53] (2018) SCORES: shape composition with recursive substructure priors. ACM Transactions on Graphics 37 (6), pp. . Cited by: §1.
 [54] (2016) Generative visual manipulation on the natural image manifold. In Proc. Euro. Conf. on Computer Vision, pp. 597–613. Cited by: §2.
Appendix A Supplementary Material
a.1 Network Architectures
Part synthesis
The architectures of the part synthesis generative autoencoders, for both 3D and 2D cases, are listed in Table 3 and Table 4 respectively. We used the following standard hyperparameters to train the 3D (2D) model: Adam optimizer, , , learning rate , batch size .
Operation  Kernel  Strides  Feature maps  Act. func. 
Encode : 400x3 point cloud 64dim feature vector  
1D conv.  1x64  1  400x64  Relu 
1D conv.  1x64  1  400x64  Relu 
1D conv.  1x64  1  400x64  Relu 
1D conv.  1x128  1  400x128  Relu 
1D conv.  1x64  1  400x64  Relu 
Max Pooling  –  –  128  – 
Decode : 64dim feature vector 400x3 point cloud  
Linear  –  –  256  Relu 
Linear  –  –  256  Relu 
Linear  –  –  400x3  – 
Operation  Kernel  Strides  Feature maps  Act. func. 

Encode : 64x64x1 input shape 10dim feature vector  
Conv.  5x5x8  2x2  32x32x8  lRelu 
Conv.  5x5x16  2x2  16x16x16  lRelu 
Conv.  5x5x32  2x2  8x8x32  lRelu 
Conv.  5x5x64  2x2  4x4x64  lRelu 
2xLinear  –  –  10  – 
Decode : 10dim feature vector 64x64x1 shape  
Linear  –  –  1024 (=4x4x64)  Relu 
Trans. conv.  5x5x32  2x2  8x8x32  Relu 
Trans. conv.  5x5x16  2x2  16x16x16  Relu 
Trans. conv.  5x5x8  2x2  32x32x8  Relu 
Trans. conv.  5x5x1  2x2  64x64x1  Sigmoid 
Parts composition
The architectures of the parts composition units are listed in Table 5 and Table 6, for the 3D and 2D cases respectively. We used the following standard hyperparameters to train the 3D and 2D models: Adam optimizer, , , learning rate , batch size .
Operation  Feature maps  Act. func. 

Comp. net : 64xC+16 feature vector 6XC comp. vector  
Linear  256  Relu 
Linear  128  Relu 
Linear  6xC  – 
Operation  Feature maps  Act. func. 

Comp. net : 10xC+8 feature vector 4XC comp. vector  
Linear  128  Relu 
Linear  128  Relu 
Linear  4xC  – 
a.2 More results
In this section, we present additional results of CompoNet for the three categories. For the Chair and Airplane categories we have randomly sampled shapes from the we have generated for the quantitative metrics; see Figure 9 and Figure 10 respectively. For the Vases, since the output is smaller, we sampled shapes from the we generated for the quantitative metrics; see Figure 11. Furthermore, we present additional interpolation results from CompoNet on the 3D categories; see Figure 12 and Figure 14 for linear interpolations, Figure 13 and Figure 15 for partbypart interpolations.
a.3 Comparison to baseline
In this section, we present randomly picked generated results from CompoNet and the baseline on the Chair and Airplane categories. For each generated shape, we present its three nearest neighbors based on the Chamfer distance. We present the results side by side to emphasize the power of CompoNet in generating novel shapes in comparison to the baseline; see Figure 16 for chairs and Figure 17 for airplanes.