Composite Shape Modeling via Latent Space Factorization
Abstract
We present a novel neural network architecture, termed DecomposerComposer, for semantic structureaware 3D shape modeling. Our method utilizes an autoencoderbased pipeline, and produces a novel factorized shape embedding space, where the semantic structure of the shape collection translates into a datadependent subspace factorization, and where shape composition and decomposition become simple linear operations on the embedding coordinates. We further propose to model shape assembly using an explicit learned part deformation module, which utilizes a 3D spatial transformer network to perform an innetwork volumetric grid deformation, and which allows us to train the whole system endtoend. The resulting network allows us to perform partlevel shape manipulation, unattainable by existing approaches. Our extensive ablation study, comparison to baseline methods and qualitative analysis demonstrate the improved performance of the proposed method.
1 Introduction
Understanding, modeling and manipulating 3D objects are areas of great interest to the vision and graphics communities, and have been gaining increasing popularity in recent years. Examples of related applications include semantic segmentation [41], shape synthesis [37, 2], 3D reconstruction [7, 8] and view synthesis [39], to name a few. The advancement of deep learning techniques, and the creation of largescale 3D shape datasets [5] enabled researchers to learn taskspecific representations directly from the existing data, and led to significant progress in all the aforementioned areas.
There is a growing interest in learning shape modeling and synthesis in a structureaware manner, for instance, at the level of semantic shape parts. This poses several challenges compared to approaches considering the shapes as a whole. Semantic shape structure and shape part geometry are usually interdependent, and relations between the two must be implicitly or explicitly modeled and learned by the system. Examples of such structureaware shape representationlearning include [22, 18, 35, 38].
However, the existing approaches for shape modeling, while being part aware at the intermediate stages of the system, still ultimately operate on the lowdimensional representations of the whole shape. For example, [22] use a Variational Autoencoder (VAE) [15] to learn a generative partaware model of manmade shapes, but the encoding space of the VAE corresponds to complete shapes. As a result, factors corresponding to different parts are entangled in that space. Thus, existing approaches cannot be utilized to perform direct partlevel shape manipulation, such as single part replacement, part interpolation, or partlevel shape synthesis.
Inspired by the recent efforts in image modeling to separate different image formation factors, to gain better control over image generation process and simplify editing tasks [27, 32, 33], we propose a new semantic structureaware shape modeling system. This system utilizes an autoencoderbased pipeline, and produces a factorized embedding space which both reflects the semantic part structure of the shapes in the dataset, and compactly encodes different semantic parts’ geometry. In this embedding space, different semantic part embedding coordinates lie in separate linear subspaces, and shape composition can naturally be performed by summing up part embedding coordinates. The embedding space factorization is datadependent, and is performed using learned linear projection operators. Furthermore, the proposed system operates on unlabeled input shapes, and at test time it simultaneously infers the shape’s semantic structure and compactly encodes its geometry.
Towards that end, we propose a DecomposerComposer pipeline, which is schematically illustrated in Figure 1. The Decomposer maps an input shape, represented by an occupancy grid, into the factorized embedding space described above. The Composer reconstructs a shape with semantic partlabels from a set of partembedding coordinates. It explicitly learns the set of transformations to be applied to the parts, so that together they form a semantically and geometrically plausible shape. In order to learn and apply those part transformations, we employ a 3D variant of the Spatial Transformer Network (STN) [11]. 3D STN was previously utilized to scale and translate objects represented as 3D occupancy grids in [10], but to the best of our knowledge, ours is the first approach suggesting innetwork affine deformation of occupancy grids.
Finally, to promote partbased shape manipulation, such as part replacement, part interpolation, or shape synthesis from arbitrary parts, we employ the cycle consistency constraint [42, 27, 23, 34]. We utilize the fact that the Decomposer maps input shapes into a factorized embedding space, making it possible to control which parts are passed to the Composer for reconstruction. Given a batch of input shapes, we apply our DecomposerComposer network twice, while randomly mixing part embedding coordinates before the first Composer application, and then demixing them into their original positions before the second Composer application. The resulting shapes are required to be as similar as possible to the original shapes, using a cycle consistency loss.
Main contributions
Our main contributions are: (1) A novel latent space factorization approach which enables performing shape structure manipulation using linear operations directly in the learned latent space; (2) The application of a 3D STN to perform innetwork affine shape deformation, for endtoend training and improved reconstruction accuracy; (3) The incorporation of a cycle consistency loss for improved reconstruction quality.
2 Related work
Learningbased shape synthesis
Learningbased methods have been used for automatic synthesis of shapes from complex realworld domains; In a seminal work [12], Kalogerakis et al. used a probabilistic model, which learned both continuous geometric features and discrete component structure, for componentbased shape synthesis and novel shape generation. Recently, the development of deep neural networks enabled learning highdimensional features more easily; 3DGAN [37] uses 3D decoders and a GAN to generate voxelized shapes. A similar approach has been applied to 3D point clouds and achieved high fidelity and diversity in shape synthesis [2]. Apart from generating shapes using a latent representation, some methods generate shapes from a latent representation with structure. SSGAN [36] generate the shape and texture for a 3D scene in a 2stage manner. GRASS [18] generate shapes in two stages: first, by generating orientated bounding boxes, and then a detailed geometry within those bounding boxes. Nash and Williams [22] use a VAE, and generate parts by dividing the latent code equally into predefined segments representing different parts. In a related work [35], Wang et al. introduced a 3D GANbased generative model for 3D shapes, which produced segmented and labeled into parts shapes. Unlike the two latter approaches, our network does not use predefined subspaces for part embedding, but learns to project the latent code of the entire shape to the subspaces corresponding to codes of different parts.
Spatial transformer networks
Spatial transformer networks (STN) [11] allow to easily incorporate deformations into a learning pipeline. Kurenkov et al. [16] retrieve a 3D model from one RGB image and generate a deformation field to modify it. Kanazawa et al. [13] model articulated or soft objects with a template shape and deformations. Lin et al. [19] iteratively use STNs to warp a foreground onto a background, and use a GAN to constrain the composition results to the natural image manifold. Hu et al. [10] use a 3D STN to scale and translate objects given as volumetric grids, as a part of scene generation network. Inspired by this line of work, we incorporate an affine transformation module into our network. This way, the generation module only needs to generate normalized parts, and the deformation module transforms and assembles the parts together.
Deep latent space factorization
Several approaches suggested to learn disentangled latent spaces for image representation and manipulation. VAE [9] introduce an adjustable hyperparameter that balances latent channel capacity and independence constraints with reconstruction accuracy. InfoGAN [6] achieves the disentangling of factors by maximizing the mutual information between certain channels of latent code and image labels. Some approaches disentangle the image generation process using intrinsic decomposition, such as albedo and shading [33], or normalized shape and deformation grid [27, 32]. Note that the proposed approach differs from [27, 32, 33] by the fact that both full and partial shape embedding coordinates reside in the same low dimensional embedding space, while in the latter, different components have their own separated embedding spaces.
Projection in neural networks
Projection is widely used in representation learning. It can be used for transformation from one domain to another domain [3, 25, 26], which is useful for tasks like translation in natural language processing. For example, Senel et al. [30] use projections to map word vectors into semantic categories, to analyze the semantic structures of word embeddings. In this work, we use a projection layer to transform a whole shape embedding into semantic part embeddings.
3 Our model
3.1 Decomposer network
The Decomposer network is trained to embed unlabeled shapes (naturally built of a set of semantic parts) into a factorized embedding space, reflecting the shared semantic structure of the shape collection. To allow for composite shape synthesis, the embedding space has to satisfy the following two properties: factorization consistency across input shapes, and existence of a simple shape composition operator to combine different semantic factors. We propose to model this embedding space as a direct sum of subspaces , where is the number of semantic parts, and each subspace corresponds to a semantic part , thus satisfying the factorization consistency property. The second property is ensured by the fact that every vector is given by a sum of (unique) , so that part composition may be performed by part embedding summation. This also implies that the decomposition and composition operations in the embedding space are fully reversible.
A simple approach for such factorization is to split the dimensions of the dimensional embedding space into coordinate groups, each group representing a certain semantic partembedding. In this case, the full shape embedding is a concatenation of part embeddings, an approach explored in [35]. This, however, puts a hard constraint on the dimensionality of part embeddings, and thus also on the representation capacity of each part embedding subspace. Given that different semantic parts may have different geometric complexities, this factorization may be suboptimal.
Instead, we propose performing a datadriven learned factorization of the embedding space into semantic subspaces. We achieve that by performing the factorization using learned partspecific projection matrices, denoted by . To ensure that the aforementioned two factorization properties hold, and that is factorized into such that (direct sum property), these projection matrices must form a partition of the identity. Namely, must satisfy the following three properties
(1) 
where and are the allzero and the identity matrices of size , respectively.
In practice, we efficiently implement the projection operators using fully connected layers without added biases, with a total of variables, constrained as in Equation 1. The projection layers receive as input a whole shape encoding, which is produced by a 3D convolutional shape encoder. The parameters of the shape encoder and the projection layers are learned simultaneously. The resulting architecture of the Decomposer network is schematically described in Figure 2, and a detailed description of the shape encoder and the projection layer architecture is given in the supplementary material.
3.2 Composer network
The composer network is trained to reconstruct shapes from sets of semantic part embedding coordinates. We assume that these part embedding sets are valid, in the sense that each set includes at most one embedding coordinate per semantic part type. The composer produces an output shape labeled with semantic part labels.
The simplest composer implementation would consist of a single decoder mirroring the (whole) shape encoder described in the previous section, which would produce an output shape with or without semantic labels. Such approach was used in [35], for instance. However, this straightforward method fails to reconstruct thin volumetric shape parts, e.g., thin chair legs, and other fine shape details. To address this issue, we use a different approach, where we first separately reconstruct scaled and centered shape parts, using a shared part decoder; we then produce perpart transformation parameters and use them to deform the parts in a coherent manner, to obtain a complete reconstructed shape.
In our model, we make the simplifying assumption that it is possible to combine a given set of parts into a plausible shape by transforming them with perpart affine transformations and translations. While the true set of transformations which produce plausible shapes is significantly larger and more complex, we demonstrate that the proposed simplified model is successful at producing geometrically and visually plausible results. This innetwork part transformation is implemented using a 3D spatial transformer network (STN) [11]. It consists of a localization net, which produces a set of 12dimensional affine transformations (including translations) for all parts, and a resampling unit, which transforms the reconstructed scaled and centered part volumes and places the parts in their correct locations in the full shape. The SNT receives as input both the reconstructed parts from the part decoder, and the sum of part encodings, for best reconstruction results.
The resulting Composer architecture consists of two components: a shared part decoder, which receives part embedding coordinates and produces centered and scaled (to unit cube) versions of parts, and a spatial transformer network (STN) that deforms and places the parts in the full assembled shape. The Composer architecture is schematically described in Figure 2; its detailed description is given in the supplementary material.
We note that the proposed approach is related to the twostage shape synthesis approach of [18], in which a GAN is first used to synthesize oriented bounding boxes for different parts, and then the part geometry is created per bounding box using a separate part decoder. Our approach is similar, yet it works in a reversed order. Namely, we first reconstruct part geometry with our shared part decoders, and then compute perpart affine transformation parameters, which are a 12dimensional equivalent of the oriented part bounding boxes in [18]. Similarly to [18], this two stage approach improves the reconstruction of fine geometric details. However, unlike [18], where the GAN and the part decoder where trained separately, in our approach the two stages belong to the same reconstruction pipeline, coupled by the full model reconstruction loss, and trained simultaneously and endtoend. On the other hand, compared to [18], our approach is limited in the sense that it reconstructs shapes from semantic parts, while [18] synthesizes shape from an arbitrary number of smaller parts, and is trained with more complex symmetrypreserving requirements. We plan to address this limitation in future work.
3.3 Cycle consistency
Our training set is comprised of 3D shapes with groundtruth semantic partdecomposition; It does not include any training examples of synthesized composite shapes. In fact, existing methods for such a shape assembly task operate on 3D meshes with very precise segmentations, and often with additional knowledge about part connectivity [40, 31]. These methods cannot be applied to a dataset (like ours) which does not come with precise segmentation and extra knowledge about the part connectivity. As a result, existing methods cannot be used to produce a sufficiently large set of plausible new shapes (constructed from existing parts) to use for training a deep network for composite shape modelling. Instead, we add a cycle consistency requirement to train the network to produce nontrivial geometrically and semantically plausible part transformations for arbitrary part arrangements.
Cycle consistency has been previously utilized in geometry processing [23], image segmentation [34], and more recently in neural image transformation [27, 42]. We use it as follows: given a batch of training examples , the Decomposer produces sets of corresponding semantic part encodings, each with encodings. During training, we randomly mix part encodings of different shapes in the batch, while ensuring that after the mixing each of the new encoding sets is valid (includes exactly one embedding coordinate per semantic part). After that, we pass the new sets with mixed encodings to the Composer, which reconstructs the shapes with correspondingly mixed parts. We then pass those shapes (as binary occupancy grids) to the Decomposer for the second time, once again producing sets of part encodings. We demix the encodings, to restore the original encodingtoshape association, and pass the demixed encoding sets to the Composer again. The cycle consistency requirement means that the results produced by the second Composer application must be as similar as possible to the original shapes, which is enforced by the cycle consistency loss described in the next section. The double application of the proposed network with part encoding mixing and demixing is schematically described in Figure 3.
3.4 Loss function
Our loss function is defined as the following weighted sum of several loss terms
(2) 
The weights compensate for the different scales of the loss terms, and reflect their relative importance.
Partition of the identity loss
measures the deviation of the predicted projection matrices from the optimal projections, as given by Equation 1.
(3) 
Part reconstruction loss
is the binary crossentropy loss between the reconstructed centered and scaled part volumes and their respective ground truth part indicator volumes, summed over parts.
Transformation parameter loss
is an regression loss between the predicted and the ground truth 12dimensional transformation parameter vectors, summed over parts. Unlike the original STN approach [11], where there was no direct supervision on the transformation parameters, we found that this supervision is critical for our network convergence. We provide more details on the exact training procedure in Section 3.5.
Whole model reconstruction loss
measures the complete shape reconstruction quality, and is given by the cross entropy loss between the resulting volume with predicted part labels, and the ground truth labeled volume.
Cycle consistency loss
measures the deviation between ground truth input volumes and their (binarized) reconstructions, obtained using two applications of the proposed network, with part encoding mixing and demixing between the first and the second composing steps, as described in Section 3.3. We measure this deviation using a binary crossentropy loss.
3.5 Training details
The network was implemented in TensorFlow [1], and trained for 700 epochs with batch size 48. We used Adam optimizer [14] with learning rate , decay rate of , and decay step size of 300 epochs. We found it essential to first pretrain the binary shape encoder, projection layer and part decoder parameters separately for epochs, by minimizing the part reconstruction and the partition of the identity losses, for improved part reconstruction results. We then train the parameters of the spatial transformer network for another epochs, using the transformation parameter loss, while keeping the rest of the parameters fixed. After that we resume the training with all parameters, and turn on the full model and cycle consistency losses after additional epochs, to finetune the reconstruction parameters. The total train time is one day on an NVIDIA Tesla V100 GPU. The optimal loss combination weights were empirically detected using the validation set, and set to be .
4 Experiments
Dataset
In our experiments, we used the chair models from the ShapeNet 3D data collection [5], with part annotations produced by Yi et al. [41]. The shapes were converted to occupancy grids using binvox [24]. Semantic part labels were first assigned to the occupied voxels according to the proximity to the labeled 3D points, and the final voxel labels were obtained using graphcuts in the voxel domain [4]. We used the official ShapeNet train, validation and test data splits in all our experiments.
Data augmentation
We perform two types of data augmentation which we found important for successful training. First, during training we randomly and independently remove parts from the training shapes, with probability of per part. Thus, our model learns to reconstruct not only complete shapes, but also shapes consisting of a subset of parts. As will be illustrated by the ablation study, this type of data augmentation is important for the network not to overfit to the training data, and learn to perform high quality reconstruction.
The second type of data augmentation is intended to assist the spatial transformer network learn to reconstruct nontrivial perpart affine transformations. Specifically, we augment input parts with random affine transformations and provide the transformed sets of parts separately to the Decomposer as an additional input, while training the spatial transformer network to predict inverse transformations, to reconstructs the ground truth labeled shape from their corresponding transformed parts.
4.1 Shape reconstruction
In this experiment, we tested the reconstruction capabilities of the proposed network. Note that for this and other experiments described below we used unlabeled shapes from the test set. Labeled ground truth shapes, when shown, are for illustration and comparison purpose only. Fig 4 presents the input unlabeled shapes (in gray), and the reconstructed shapes composed of semantic parts (colorcoded).
4.2 Composite shape synthesis
Shape composition by part exchange
We randomly picked pairs of shapes, and exchanged parts between them by mapping the input unlabeled shapes into the embedding space with the Decomposer, exchanging a certain semantic part encodings, and composing shapes from the new part arrangements with the Composer. The results are shown in Figure 5, and demonstrate the ability of our system to perform accurate part exchange, while deforming the geometry of both the new and the existing parts to obtain a plausible result.
Shape composition by random part assembly
In this experiment we tested the ability of the proposed network to assemble shapes from random parts using our factorized embedding space. Here, we worked with batches of size four. We mapped the input shapes into the embedding space with the Decomposer; we then randomly mixed the part embedding coordinates as described in Section 3.2, ensuring that no two encodings in the new set came from the same original shape; finally, we composed the new shapes with mixed parts using the Composer. The results are shown in Figure 6, and illustrate the ability of the proposed method to combine parts from different shapes and deform them so that the resulting shape looks realistic.
See the supplementary material for additional results of shape reconstruction, part exchange and assembly from parts, for the chairs and two additional classes of shapes from the ShapeNet (planes and tables).
Full and partial interpolation in the embedding space
In this experiment, we tested reconstruction from linearly interpolated embedding coordinates of complete shapes, as well as interpolated embedding coordinates of a single semantic part. For the latter, we interpolated the embedding coordinates of one of the original shape parts and the corresponding semantic part from another randomly picked shape, while keeping the rest of part embedding coordinates fixed. The results are shown in Figure 7. See the supplementary material for a detailed description of the interpolated shape reconstruction process, and more shape and part interpolation examples.
4.3 Embedding space and projection matrix analysis
Embedding space
Projection matrix analysis
Figure 9 shows the obtained projection matrices, their sum, and the plot of their singular values. The proposed method succeeds to obtain a set of projection matrices which approximately sum to an identity, and have a partition of the identity loss in Eq. (3.4) of the order of one, for a hundreddimensional embedding space and four semantic subspaces. While are fullrank and not strictly orthogonal projection matrices, the plot of their singular values shows that their effective ranks are significantly lower than the embedding space dimension. This is also in line with the excellent separation into nonoverlapping subspaces produced by these projection matrices.
mIoU 

Connectivity 



Rec.  Rec.  Rec.  Swap  Col.  Rec.  Swap  Col.  Rec.  Swap  Col.  
Our method  0.63  0.66  0.87  0.85  0.80  0.92  0.74  0.56  0.93  0.93  0.93  


Fixed projection  0.61  0.66  0.82  0.79  0.78  0.90  0.65  0.47  0.92  0.93  0.93  
Decoder w/o STN  0.77  0.77  0.81  0.65  0.58  0.96  0.72  0.53  0.94  0.92  0.91  
W/o data augmentation  0.60  0.66  0.85  0.83  0.79  0.86  0.69  0.51  0.93  0.93  0.93  
W/o part removal  0.61  0.55  0.75  0.70  0.68  0.91  0.78  0.69  0.88  0.89  0.89  
W/o cycle loss  0.64  0.62  0.77  0.71  0.67  0.89  0.65  0.50  0.93  0.93  0.93  
Naive placement        0.68  0.62    0.47  0.21    0.96  0.96 
4.4 Ablation study
To highlight the importance of the different elements of the proposed approach, we conducted an ablation study, where we used several variants of the proposed method, as listed below, as well as a naive part placement baseline.
Fixed projection matrices
Here, instead of using learned projection matrices in the Decomposer, the dimensional shape encoding is split into consecutive equalsized segments, which correspond to different part embedding subspaces. This is equivalent to using constant projection matrices, where the elements of the rows corresponding to a particular embedding space dimensions are , and the rest of the elements are .
Composer without STN
Here, we substituted the proposed composer, consisting of the part decoder and the STN, with a single decoder producing a labeled shape. The decoder receives the sum of part encodings as an input, processes it with two FC layers to combine information from different parts, and then reconstructs a shape with parts labels using a series of deconvolution steps, similar to the part decoder in the proposed architecture.
Without random part removal
Here, we removed the first type of data augmentation, and used only complete shapes for training.
Without affine transformation augmentation
Here, we removed the second type of data augmentation, namely  adding random affine transformations to the input parts. Instead we supplied to the net the original parts without transformations.
Without cycle loss
Here, we removed the cycle loss component during the network training.
Baseline method
Here, given input parts, we simply placed them in the output volume at their original positions. All the shapes in our dataset are centered and uniformly scaled to fill the unit volume, and there are clusters of similarly looking chairs. Thus, we can expect that even this naive approach without part transformations will produce plausible results in some cases.
Evaluation metrics
For the evaluation, we used the following metrics.
 mIoU

Mean Intersection over Union (mIoU) [20] is used to evaluate the performance of segmentation algorithms. Here, it is used as metric for the quality of the reconstruction. We computed mIoU for both scaled and centered (when applicable) and actualsized reconstructed parts.
 Connectivity

In part based shape synthesis, one pathological issue is that parts are often disconnected with each other. Here, we would like to benchmark the quality of part placement, in terms of part connectivity. For each volume, we dilate each voxel by in order to allow for small gaps between parts. Then, the frequency of the shape forming one single connected component was calculated.
 Classification accuracy

We trained a binary neural classifier to estimate the quality of the assembly in terms of the ”realisticness” of a given shape. Specifically, this classifier was trained to distinguish between groundtruth whole chairs (acting as positive examples) and chairs produced by naively placing random chair parts together (acting as negative examples). To construct negative examples, we used groundtruth chair parts from arbitrary ’source’ chairs, considering the addition of a parttype (e.g., legs) at most once, and placing each part at the same location and with the same orientation it had in the source chair from which it was extracted. In addition, we removed negative examples assembled of parts from geometrically and semantically similar chairs, since such part arrangement could produce plausible shapes incorrectly placed in the negative example set. The attained classification accuracy on the test set was . For a given set of chairs, we report the average classification score. Details of the network can be found in the supplementary material.
 Symmetry

The chair shapes in the ShapeNet are predominantly bilaterally symmetric, with vertical symmetry plane. Hence, similar to [35], we evaluate the symmetry of the reconstructed shapes, and define the symmetry score as the percentage of the matched voxels (filled or empty) in the reconstructed volume, and the same volume reflected with respect to the symmetry plane. We perform this evaluation using binarized reconstruction results, effectively measuring the global symmetry of the shapes.
For evaluation, we used the shapes in the test set (690 shapes), and conducted three types of experiments: full shape encoding and reconstruction, full shape encoding and single random part exchange between a pair of random shapes, shape composition by random part assembly. The experiments are described in more detail in Sections 4.1 and 4.2.
Evaluation results
According to the mIoU and perpart mIoU metrics, the proposed method outperforms all other variants and the baseline, except when using the simple shape decoder. This follows from the fact that the proposed system, while reconstructing better fine geometry features, decomposes the problem into two inference problems, for the geometry and the transformation, and thus does not produce as faithful reconstruction of the original model as the simple decoder. On the other hand, as illustrated in Figure 10, this allows our method to perform better when constructing a shape from random parts. See the supplementary material for an additional qualitative comparison of the results of the proposed and the baseline methods.
In the connectivity test, our method outperformed all the baselines. This shows that the proposed network effectively solves the problem of misplaced disconnected parts better than the baseline methods. This is also verified by the qualitative comparison in Figure 10 and additional examples in the supplementary material. Specifically, improved connectivity results, as compared to the architecture with a Decoder without STN (”W/o STN”), illustrate the importance of using the STN to improve assembly quality. In the symmetry test, all methods, except for training without random part removal (”W/o part removal”), show comparable results. As expected, the naive placement achieves the highest symmetry score, equal to the score of the original test shapes, since it preserves their symmetry during shape assembly.
Interestingly, methods showing better mIoU or classification accuracy, such as the architecture without STN, or training without random part removal (”W/o part removal”), perform quite poorly on the connectivity benchmark and in qualitative comparison, and the latter  also on the symmetry benchmark. And, although the proposed method doesn’t get the best performance in mIoU and classifier benchmarks, it usually gets second best at those and at the same time produces shape with better connectivity and symmetry. Overall, the proposed method generates good performance in these four benchmarks and superior qualitative results, justifying our design choices.
Affine transformation analysis
We performed a statistical analysis of the learned affine transformations. Please refer to the supplementary material for more details.
5 Conclusions and future work
We presented a DecomposerComposer network for structureaware 3D shape modelling. It is able to generate a factorized latent shape representation, where different semantic part embedding coordinates lie in separate linear subspaces. The subspace factorization allows us to perform shape manipulation via part embedding coordinates, exchange parts between shapes, or synthesize novel shapes by assembling a shape from random parts. Qualitative results show that the proposed system can generate high fidelity 3D shapes and meaningful part manipulations. Quantitative results shows we are competitive in the mIOU, connectivity, symmetry and classification benchmarks.
While the proposed approach makes a step toward automatic shapefrompart assembly, it has several limitations. First, while we can generate highfidelity shapes at a relatively low resolution, memory limitations do not allow us to work with higher resolution voxelized shapes. Memoryefficient architectures, such as OctNet [29] and PointGrid [17], may help alleviate this constraint. Alternatively, using pointbased shape representations, and a compatible deep network architecture, such as [28], may also reduce the memory requirements and increase the output resolution. Secondly, we made a simplifying assumption that a plausible shape can be assembled from parts using perpart affine transformations, which represent only a subset of possible transformations. While this assumption simplifies the training, it is quite restrictive in terms of the deformations we can perform. In future work, we will investigate that further, with general transformations which have higher degree of freedom, such as a 3D thin plate spline or a general deformation field. Finally, we have been using a crossentropy loss to measure the shape reconstruction quality; it would be interesting to investigate the use of a GANtype loss in this structureaware shape generation context.
Acknowledgements
L. Guibas acknowledges NSF grant CHS1528025, a Vannevar Bush Faculty Fellowship, and gifts from the Adobe and Autodesk Corporations. A. Dubrovina and M. Shalah acknowledge the support in part by The Eric and Wendy Schmidt Postdoctoral Grant for Women in Mathematical and Computing Sciences.
References
 [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 [2] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Representation learning and adversarial generation of 3d point clouds. arXiv preprint arXiv:1707.02392, 2017.
 [3] J. Barnes, R. Klinger, and S. S. i. Walde. Projecting embeddings for domain adaption: Joint modeling of sentiment analysis in diverse domains. arXiv preprint arXiv:1806.04381, 2018.
 [4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on pattern analysis and machine intelligence, 23(11):1222–1239, 2001.
 [5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
 [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
 [7] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. In European conference on computer vision, pages 628–644. Springer, 2016.
 [8] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR, volume 2, page 6, 2017.
 [9] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. 2016.
 [10] R. Hu, Z. Yan, J. Zhang, O. van Kaick, A. Shamir, H. Zhang, and H. Huang. Predictive and generative neural networks for object functionality. In Computer Graphics Forum (Eurographics Stateoftheart report), volume 37, pages 603–624, 2018.
 [11] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
 [12] E. Kalogerakis, S. Chaudhuri, D. Koller, and V. Koltun. A probabilistic model for componentbased shape synthesis. ACM Transactions on Graphics (TOG), 31(4):55, 2012.
 [13] A. Kanazawa, S. Kovalsky, R. Basri, and D. Jacobs. Learning 3d deformation of animals from 2d images. In Computer Graphics Forum, volume 35, pages 365–374. Wiley Online Library, 2016.
 [14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [15] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [16] A. Kurenkov, J. Ji, A. Garg, V. Mehta, J. Gwak, C. Choy, and S. Savarese. Deformnet: Freeform deformation network for 3d shape reconstruction from a single image. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 858–866. IEEE, 2018.
 [17] T. Le and Y. Duan. Pointgrid: A deep network for 3d shape understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9204–9214, 2018.
 [18] J. Li, K. Xu, S. Chaudhuri, E. Yumer, H. Zhang, and L. Guibas. Grass: Generative recursive autoencoders for shape structures. ACM Transactions on Graphics (TOG), 36(4):52, 2017.
 [19] C.H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey. Stgan: Spatial transformer generative adversarial networks for image compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9455–9464, 2018.
 [20] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
 [21] L. v. d. Maaten and G. Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 [22] C. Nash and C. K. Williams. The shape variational autoencoder: A deep generative model of partsegmented 3d objects. In Computer Graphics Forum, volume 36, pages 1–12. Wiley Online Library, 2017.
 [23] A. Nguyen, M. BenChen, K. Welnicka, Y. Ye, and L. Guibas. An optimization approach to improving collections of shape maps. In Computer Graphics Forum, volume 30, pages 1481–1491. Wiley Online Library, 2011.
 [24] F. S. Nooruddin and G. Turk. Simplification and repair of polygonal models using volumetric techniques. IEEE Transactions on Visualization and Computer Graphics, 9(2):191–205, 2003.
 [25] C. Poelitz. Projection based transfer learning. In Workshops at ECML, 2014.
 [26] D. V. Poerio and S. D. Brown. Dualdomain calibration transfer using orthogonal projection. Applied spectroscopy, 72(3):378–391, 2018.
 [27] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. MorenoNoguer. Ganimation: Anatomicallyaware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), pages 818–833, 2018.
 [28] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
 [29] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 3, 2017.
 [30] L. K. Senel, I. Utlu, V. Yucesoy, A. Koc, and T. Cukur. Semantic structure and interpretability of word embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018.
 [31] C.H. Shen, H. Fu, K. Chen, and S.M. Hu. Structure recovery by part assembly. ACM Transactions on Graphics (TOG), 31(6):180, 2012.
 [32] Z. Shu, M. Sahasrabudhe, A. Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance, 2018.
 [33] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5444–5453. IEEE, 2017.
 [34] F. Wang, Q. Huang, and L. J. Guibas. Image cosegmentation via consistent functional maps. In Proceedings of the IEEE International Conference on Computer Vision, pages 849–856, 2013.
 [35] H. Wang, N. Schor, R. Hu, H. Huang, D. CohenOr, and H. Huang. Globaltolocal generative model for 3d shapes. ACM Transactions on Graphics (Proc. SIGGRAPH ASIA), 37(6):214:1â214:10, 2018.
 [36] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pages 318–335. Springer, 2016.
 [37] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
 [38] Z. Wu, X. Wang, D. Lin, D. Lischinski, D. CohenOr, and H. Huang. Structureaware generative network for 3dshape modeling. arXiv preprint arXiv:1808.03981, 2018.
 [39] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese. Gibson env: Realworld perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9068–9079, 2018.
 [40] K. Xu, H. Zheng, H. Zhang, D. CohenOr, L. Liu, and Y. Xiong. Photoinspired modeldriven 3d object modeling. In ACM Transactions on Graphics (TOG), volume 30, page 80. ACM, 2011.
 [41] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, L. Guibas, et al. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG), 35(6):210, 2016.
 [42] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint, 2017.