SdmNet: Deep Generative Network for Structured Deformable Mesh
Abstract.
We introduce SDMNET, a deep generative neural network which produces structured deformable meshes. Specifically, the network is trained to generate a spatial arrangement of closed, deformable mesh parts, which respects the global part structure of a shape collection, e.g., chairs, airplanes, etc. Our key observation is that while the overall structure of a 3D shape can be complex, the shape can usually be decomposed into a set of parts, each homeomorphic to a box, and the finerscale geometry of the part can be recovered by deforming the box. The architecture of SDMNET is that of a twolevel variational autoencoder (VAE). At the part level, a PartVAE learns a deformable model of part geometries. At the structural level, we train a Structured Parts VAE (SPVAE), which jointly learns the part structure of a shape collection and the part geometries, ensuring the coherence between global shape structure and surface details. Through extensive experiments and comparisons with the stateoftheart deep generative models of shapes, we demonstrate the superiority of SDMNET in generating meshes with visual quality, flexible topology, and meaningful structures, benefiting shape interpolation and other subsequent modeling tasks.
108
1. Introduction
Triangle meshes have been the dominant 3D shape representation in computer graphics, for modeling, rendering, manipulation, and animation. However, as deep learning becomes pervasive in visual computing, most deep convolutional neural networks (CNNs) developed for shape modeling and analysis have resorted to other representations including voxel grids [Girdhar16b; Qi2016; 3dgan2016; Wu_2015_CVPR], shape images [Su2015ICCV; Sinha2016DeepL3], and point clouds [Qi2017cvpr; yin_sig18]. One of the main reasons is that the nonuniformity and irregularity of triangle tessellations do not naturally support conventional convolution and pooling operations [Hanocka2019]. Yet, advantages of meshes over other shape representations should not be overlooked.
Compared to voxels, meshes are more compact and better suited to representing finer surface details. Compared to points, meshes are more controllable and exhibit better visual quality. There have been recent attempts at developing meshspecific convolutional operators designed for triangle tessellations [Poulenard2018; Hanocka2019]. Current deep generative models for meshes are limited to either genuszero meshes [Hamu2018; Maron2017TOG] or meshes sharing the same connectivity [Gao2018; meshvae2017]. Patchbased models which cover a shape with planar [Wang2018ocnn] or curved [AtlasNet2018] patches, are more adaptive, but surface quality is often tampered by visible seams and the patches are otherwise unstructured and incoherent.
In this paper, we introduce a novel deep generative neural network for meshes which overcomes the above limitations. Our key observation is that while the overall structure of a 3D shape can be complex, the shape can usually be decomposed into a set of parts, each homeomorphic to a box, and the finerscale geometry of the part can be recovered by deforming the box. Hence, the architecture of our network is that of a twolevel variational autoencoder (VAE) [kingma2013auto] which produces structured deformable meshes (SDM). At the part level, a PartVAE learns a deformable model of shape parts, by means of autoencoding fixedconnectivity, genuszero meshes. At the structural level, we train a Structured Parts VAE (SPVAE), which jointly learns the part structure of a shape collection and the part geometries, ensuring the coherence between global shape structure and surface details.
We call our network SDMNET, as it is trained to generate structured deformable meshes, that is, a spatial arrangement of closed, deformable mesh parts, which respects the global part structure (e.g., symmetry and support relations among shape parts) of a shape collection, e.g., chairs, airplanes, etc. However, our network can generate shapes with a varying number of parts, up to a maximum count. Besides the advantages afforded by meshes mentioned above, a structured representation allows shapes generated by SDMNET to be immediately reusable, e.g., for assemblybased modeling [Mitra2013]. In addition, the deformability of the mesh parts further facilitates editing and interpolation of the generated shapes.
SDMNET is trained with a shape collection equipped with a consistent part structure, e.g., semantic segmentation. However, the shapes in the collection can be with arbitrary topologies and mesh connectivities. Such data sets are now widely available, e.g., ShapeNet [shapenet] and PartNet [mo2018partnet], to name a few. While direct outputs from SDMNET are not watertight meshes, each part is.
In summary, the main contributions of our work are:

The first deep generative neural network which produces structured deformable meshes.

A novel network architecture corresponding to a twolevel variational autoencoder which jointly learns shape structure and geometry. This is in contrast to the recent work, GRASS [li_sig17], which learns shape structure and part geometry using separate networks.

A supportbased part connection optimization to ensure the generation of plausible and physically valid shapes.
Figure 1 demonstrates the capability of our SDMNET to reconstruct shapes with flexible structure and fine geometric details. By interpolating in the latent space, new plausible shapes with substantial structure change are generated.
Through extensive experiments and comparisons with the stateoftheart deep generative models of shapes, we demonstrate the superiority of SDMNET in generating quality meshes and shape interpolations. We also show the structured deformation meshes produced by SDMNET enable other applications such as mesh editing, which are not directly supported by the output from other contemporary deep neural networks.
2. Related Work
With the resurgence of deep neural networks, in particular CNNs, and an increasing availability of 3D shape collections [shapenet], a steady stream of geometric deep learning methods have been developed for discriminative and generative processing of 3D shapes. In this section, we mainly discuss papers mostly related to our work, namely deep generative models of 3D shapes, and group them based on the underlying shape representations.
Voxel grids
The direct extension of pixels in 2D images to 3D is the voxel representation, which has a regular structure convenient for CNNs [Girdhar16b; 3dgan2016; Qi2016]. Variational autoencoders (VAEs) [kingma2013auto] and Generative Adversarial Networks (GANs) [goodfellow2014generative] can be built with this representation to produce new shapes. Wu et al. [pageSAGnet19] utilize an autoencoder of two branches to encode geometry features and structure features separately, and fuse them into a single latent code to intertwine the two types of features for shape modeling. However, these voxel based representations have huge memory and calculation costs, when the volumetric resolution is high. To address this, sparse voxelbased methods use octrees to adaptively represent the geometry. However, although such adaptive representations can significantly reduce the memory cost, their expressiveness of geometric details is still limited by the resolution of leaf nodes of octrees [Wang2017; Octree2017iccv]. As an improvement, recent work [Wang2018ocnn] utilizes local planar patches to approximate local geometry in leaf nodes. However, planar patches still have limited capability of describing local geometry, especially for complex local shapes. The patches are in general not smooth or connected, and require further processing, which might degrade the quality of generated shapes.
Multiview images
To exploit imagelike structures while avoiding the high cost of voxels, projecting shapes to multiple 2D views is a feasible approach. Su et al. [Su2015ICCV] project 3D shapes to multiview images, along with a novel pooling operation for 3D shape recognition. This representation is regular and efficient. However, it does not contain the full 3D shape information. So, although it can be directly used for recognition, additional efforts and processing are needed to reconstruct 3D shapes [3DVAE]. It also may not fully recover geometric details due to the incomplete information in multiview images.
Point clouds
Point clouds have been widely used to represent 3D shapes, since they are flexible and can easily represent the raw data obtained from 3D scanners. The major challenge for deep learning on point clouds is their irregular structure. Qi et al. [Qi2017cvpr; Qi2017nips] propose PointNet and PointNet++ for 3D classification and segmentation, utilizing pooling operations that are order independent. Yang et al. [yang2017view] exploit an interactive system for segmenting point clouds of indoor scenes. Fan et al. [fan2016point] use point clouds to reconstruct 3D objects from a given image. Achlioptas et al. [achlioptas18a] introduce a deep autoencoder network for shape representation and generation. However, learning from irregular point clouds is still challenging and their method is only able to produce relatively coarse geometry.
Meshes and multichart representations
Deformable modeling of a shape collection, especially of human models [anguelov2005scape; pons2015dyna], operates on meshes with the same connectivity while altering the mesh vertex positions; the shape collection can be viewed as deformations of a template model. For high quality shape generation, especially with large deformations, a manually crafted deformation representation [Gao2017] is employed by [meshvae2017; Gao2018]. Although these methods can represent and generate shapes with fine details, they require meshes to have the same connectivity. Wang et al. [wang2018pixel2mesh] reconstruct a meshbased 3D shape from an RGB image by deforming a spherelike genuszero mesh model. Dominic et al. [Dominic2018] use a CNN to infer the parameters of freeform deformation (FFD) to deform a template mesh, guided by a target RGB image. Both methods require an image as input to provide guidance, and thus cannot be used for general shape generation tasks without guidance. Moreover, deforming a single mesh limits the topological and geometric complexity of generated shapes.
Multichart representations attempt to overcome the above restriction by generating multiple patches that cover a 3D shape. Zhou et al. [Zhou2004] create texture atlases with less stretches for texture mapping. Hamu et al. [Hamu2018] generate a 3D shape as a collection of conformal toric charts [Maron2017TOG], each of which provides a cover of the shape with low distortion. Since toric covers are restricted to genuszero shapes, their multichart method still has the same limitation. AtlasNet [AtlasNet2018] generates a shape as a collection of patches, each of which is parameterized to a 2D domain as an atlas. While the patches together cover the shape well, visible seams can often be observed. In general, neither the atlases nor the toric charts correspond to meaningful shape parts; the collection is optimized to approximate a shape, but is otherwise unstructured. In contrast, SDMNET produces structured deformable meshes.
Implicit representations
Several very recent works [chen2019IMNET; park2019DeepSDF; mescheder2019Occupancy] show great promise of generative shape modeling using implicit representations. These deep networks learn an implicit function which defines the inside/outside statuses of points with respect to a shape or a signed distance function. The generative models can be applied to various applications including shape autoencoding, generation, interpolation, completion, and singleview reconstruction, demonstrating superior visual quality over methods based on voxels, point clouds, as well as patchbased representations. However, none of these works generate structured or deformable shapes.
Shape structures
Manmade shapes are highly structured, which motivates structureaware shape processing [Mitra2013]. Works on learning generative models of 3D shape structures can be roughly divided into two categories [chaudhuri2019]: probabilistic graphical models and deep neural networks.
Huang et al. [Huang2015CGF] propose a probabilistic model which computes part templates, shape correspondences, and segmentations from clustered shape collections, and their points in each part are influenced by their correspondence in the template. Similar to [Huang2015CGF], ShapeVAE [nash2017shape] generates point coordinates and normals based on different parts, but uses a deep neural network instead of a probabilistic model. Compared to the above two works, our method does not require pointwise correspondences, which can be difficult or expensive to obtain reliably. Moreover, our method encodes both global spatial structure like support relations, and local geometric deformation, producing shapes with reasonable structures and fine details.
Li et al. [li_sig17] introduce GRASS, a generative recursive autoencoder for shape structures, based on Recursive Neural Networks (RvNNs). Like SDMNET, GRASS also decouples structure and geometry representations. However, a key difference is that SDMNET jointly encodes global shape structure and part geometry, while GRASS trains independent networks for structure and part geometry generations. In terms of outputs, GRASS generates a hierarchical organization of bounding boxes, and then fills them with voxel parts. SDMNET produces a set of shape parts, each of which is a deformable mesh to better capture finer surface details. Lastly, the structural autoencoder of GRASS requires symmetry hierarchies for training while SDMNET only requires consistent semantic segmentation and employs the support information to produce shapes with support stability.
Concurrent to SDMNET, Mo et al. [mo2019structurenet] develop StructureNet, which learns a generative autoencoder of shape structures based on graph neural networks. StructureNet shares much commonality with GRASS but extends it in two important ways. First, unlike GRASS, which is limited to encoding binary trees, StructureNet can directly encode shapes represented as ary graphs, aimed to facilitate a consistent hierarchical representation of shapes within the same category. Second, StructureNet also accounts for horizontal interpart relationships between siblings. The outputs from StructureNet are either box structures or point cloud shapes. Our SDMNet analyzes and encodes shape structures by not only using the consistent representation across the same shape families but also with support stability. In addition, it is expected our meshbased representation with deformable parts is able to capture more geometry details than box and point cloud based representations adopted by StructureNet.
3. Methodology
Overview. Given a collection of shapes of the same category with partlevel labels, our method represents them using a structured set of deformable boxes, each corresponding to a part. The pioneering works [Ovsjanikov2011; Kim2013TOG] have shown the representation power of using a collection of boxes to analyze and explore shape collections. However, it is highly nontrivial to extend their techniques to shape generation, since boxes are generally of a coarse representation. We tackle this challenge by allowing individual boxes to be flexibly deformable and propose a twolevel VAE architecture called SDMNET, including PartVAE for encoding the geometry of deformable boxes, and SPVAE for joint encoding of part geometry and global structure such as symmetry and support. Moreover, to ensure that decoded shapes are physically plausible and stable, we introduce an optimization based on multiple constraints including support stability, which can be compactly formulated and efficiently optimized. Our SDMNET model allows easy generation of plausible meshes with flexible structures and fine details.
We first introduce the encoding of each part, including both the geometry and structure information. We then introduce our SDMNET involving VAEs at both the local part geometry level (PartVAE), and global joint embedding of structure and geometry (SPVAE). Then we briefly describe how the support relationships are extracted, and finally present our optimization for generating plausible and well supported shapes.
3.1. Encoding of a Shape Part
Based on semantic partlevel labels, a shape is decomposed into a set of parts. Each part is represented using a deformable bounding box, as illustrated in Figure 3. Let be the total number of part labels that appear across different shapes in the specified object category. For a given shape, it may contain a fewer number of parts as some parts may not be present. To make analysis and relationship representation easier, we assume the initial bounding box (before deformation) of each part is axis aligned. This is sufficient in practice, since each bounding box is allowed to have substantial deformation to fit the target geometry. The initial bounding primitive being a box does not prevent the internal part geometry from being complex, since geometric details can be captured and preserved through nonrigid registration (see Section 3.2 for details). Without loss of generality, the bounding boxes are used in our framework.
The geometry and associated relationships of each part are encoded by a representation vector , as illustrated in Figure 2. The detailed definition of this vector is given as follows.

indicates the existence of this part.

is a vector with dimensions to indicate which parts are supported by this part.

is a vector with dimensions to indicate which parts support the current part.

is the 3D position of the bounding box center.

indicates the existence of a symmetric part.

records the parameters  of the symmetry plane represented in an implicit form, i.e., .

is the encoded vector from the PartVAE described in Section 3.2, which encodes its geometry. By default, .
The ID of each part, used in and , is predetermined and stored in advance for the dataset. Each value in , , and is if exists and otherwise. For generated vectors, we treat a value above as true and below as false. The length of this vector is and between 77 and 101 for all the examples in this paper. Note that other information such as the label of the part that is symmetric to the current one (if exists) is fixed for a collection (e.g. the right armrest of a chair is symmetric to the left armrest of the chair) and therefore not encoded in the vector. In our current implementation, we only consider reflection symmetry and keep one symmetric component (if any) for each part. Although this is somewhat restrictive, it is very common and sufficient to cope with most cases. In practice, we first perform global reflection symmetry detection [podolak2006planar] to identify components that are symmetric to each other w.r.t. a symmetry plane. This is then supplemented by local reflection symmetry detection by checking if pairs of parts have reflective symmetry.
3.2. PartVAE for Encoding Part Geometry
For each part, the axisaligned bounding box (AABB) is first calculated. The bounding box of the same part type provides a uniform domain across different shapes, and the geometry variations are viewed as different deformation functions applied to the same domain. We take a common template, namely a unit cube mesh with triangles, to represent each part. We first translate and scale it to fit the bounding box of the part. Denote by the bounding box transformed from for the part on the shape . We treat it as initialization, and apply nonrigid coarsetofine registration [Zollhofer2014], which deforms to , as illustrated in Figure 3 (d). shares the geometry details with the part and has the same mesh connectivity as the unit cube box .
The variational autoencoder has been used to encode the geometric priors for point cloud segmentation [meng2018vv] and mesh generation [meshvae2017]. Similar to [meshvae2017], using meshes of the same connectivity makes it feasible to build a variational autoencoder to represent the deformation of . The convolutional VAE architecture in [Gao2018] is employed for compactly representing plausible deformation of each part, allowing new variations to be synthesized. The architecture is shown in Figure 4. The input is a dimensional matrix, where is the number of vertices for the template bounding box mesh. Each row of the matrix is a 9dimensional vector that characterizes the local deformation of 1ring neighborhood of each vertex including the rotation axis, rotation angle and scaling factor. It passes through two convolutional layers followed by a fully connected layer to obtain the mean and variance. The decoder mirrors the structure of the encoder to recover the deformation representation, but with different trainable weights. Since each part type has its own characteristics, we train a PartVAE for all the parts with the same part type across different shapes.
3.3. Supporting Structure Analysis
Structure captures the relationships between parts, and proper encoding of shape structure is crucial to generate plausible shapes. Symmetry as one of the structural relationships has been well explored, for example effectively used in GRASS [li_sig17]. Besides symmetry, support relationships have been demonstrated useful for structural analysis to synthesize physically plausible new structures with support and stability [Huang2016TVCG]. Compared to symmetry, support relationships provide a more direct mechanism to model the relations between adjacent parts. We thus use both symmetry and support to encode the structural relationships between parts (Section 3.1). Note that our work is the first attempt to encode supportbased shape structures in deep neural networks.
Following [Huang2016TVCG], we detect the support relations between adjacent parts as one of three support substructures, namely, “support from below”, “support from above”, and “support from side”. As illustrated in Figure 5, the detected support relations turn an undirected adjacency graph to a directed support graph. For each detected support relation of a part, we encode the labels of its supported and supporting parts in our part feature vector (Section 3.1). Our feature coding is flexible to represent the cases including one part being supported by multiple parts, as well as multiple parts being supported by one part. Since for all the cases in our shape dataset, the substructure type for each support relation between two adjacent parts is fixed, the support substructure types are kept in a lookup table but not encoded in our part feature vector. Given the supporting and supported part labels from a decoded part feature vector, we can efficiently obtain the corresponding substructures from this lookup table.
3.4. SPVAE for Structured Deformable Mesh Encoding
We build SPVAE to jointly encode the structure of a shape represented as the layout of boxes, and the geometry of its parts. By analyzing their joint distribution, it helps ensure that the geometry of the generated shape is coherent with the structure and the geometries of individual parts are consistent (i.e., of compatible styles). Our SPVAE takes the concatenation of representation vectors for all the parts as input (see Section 3.1). It encodes parts in a consistent order during encoding and decoding. This concatenated vector covers both the geometric details of individual parts encoded using PartVAE, and the relationships between them. The SPVAE uses multiple fully connected layers, and the architecture is illustrated in Figure 6.
Let and denote the encoder and decoder of our SPVAE network, respectively. represents the input concatenated feature vector of a shape, is the encoded latent vector, and is the reconstructed feature vector. Our SPVAE minimizes the following loss:
(1) 
where and are the weights of different loss terms, and
(2) 
denotes the MSE (mean squared error) reconstruction loss to ensure better reconstruction. Here is the training dataset and is the number of shapes in the training set.
(3) 
is the KL divergence to promote Gaussian distribution in the latent space, where is the posterior distribution given feature vector , and is the Gaussian prior distribution. is the squared norm regularization term of the network parameters used to avoid overfitting. The Gaussian distribution makes it effective to generate new shapes by sampling in the latent space, which is used for random generation and interpolation.
3.5. Shape Generation and Refinement
The latent space of SPVAE provides a meaningful space for shape generation and interpolation. Extensive experimental results are shown in Section 5. Random sampling in the latent space can generate novel shapes. However, although the desired geometry and structure from the decoded feature vector are generally reasonable, they may not satisfy supporting and physical constraints exactly, resulting in shapes which may include parts not exactly in contact, or may be unstable. Inspired by [Averkiou2014EG], we propose to use an effective global optimization to refine the spatial relations between parts by mainly using the associated symmetry and support information.
Denote the center position and size (half of the length in each dimension) of the part as and , each being a 3dimensional vector corresponding to , and axes, where is directly obtained from the representation vector, and is determined by the bounding box after recovering the part geometry. Denote by and the position and size of the part after global optimization. The objective of this optimization is to minimize the changes between the optimized position/scale and the original position/scale
(4) 
while ensuring the following constraints are satisfied. is a weight to balance the two terms, and is fixed to in our experiments. The symmetry and equal length constraints are from [Averkiou2014EG], though we use the support relationships to help identify equal length constraints more reliably. The remaining constraints are unique in our approach. Figure 7 illustrates typical problematic cases which are handled by our refinement optimization.
Symmetry Constraint: If the generated part has the symmetry indicator flagged in its representation vector, its symmetry part (denoted using index ) also exists. Let and denote the normal and the intercept of the symmetry plane, respectively. Enforcing the symmetry constraint leads to the following constraints to be satisfied: , . The symmetry of two parts is viewed as an undirectional relationship. If the symmetry indicator ( of the representation vector) of either part or is , we consider these two parts as symmetric.
Equal Length Constraint: A set of parts which are simultaneously supported by a common part, and simultaneously support another common part, are considered as a group to have the same length along the supporting direction. For this purpose, the ground is considered as a virtual part. For example, the four legs of a table supporting the same table top part and at the same time being supported by the ground should have the same height. These can be easily detected by traversing the support structure. An example that violates this constraint is illustrated in Figure 7 (a). The equal length constraints can be formulated as , where is an equallength group containing the part, and is the supporting direction. respectively represents , and directions where is the upright direction.
Support Relationship Constraint: In order for a supporting part to well support a part being supported, two requirements are needed: 1) in the supporting direction, the bounding box of the supporting part should have tangential relation (or a small amount of intersection in practice) with the bounding box of the part being supported (see Figure 7 (e) for a problematic case violating this constraint). If the part supports the part, the following inequality should be satisfied along the supporting direction. , where is the supporting direction, and controls the amount of overlap allowed and is set to in our experiments. 2) assuming and are the bounding boxes of parts and projected onto the plane orthogonal to the supporting direction , it should satisfy that either or (see Figure 7 (cd) for examples). This constraint can be formulated as an integer programming problem and solved efficiently during the optimization. The detailed formulation is provided in the Appendix.
Stable Support Constraint: For the “support above” relation (), the center of a supported part should be located in the supporting bounding box (the bounding box that covers all the bounding boxes of the supporting parts) for stable support (see Figure 7 (b) for an example violating this constraint). For a single supported part, the following constraints should be followed. , . For multiple supporting parts (e.g. four legs supporting the table top), the lower bound and upper bound of the and directions will be chosen from the corresponding parts.
This quadratic optimization with linear integer programming is solved by TOMLAB [holmstrom2004tomlab] efficiently. We show an example of shape refinement in Figure 8.
4. Dataset and Network Implementation
We now give the details of our network architecture and training process. The experiments were carried out on a computer with an i7 6850K CPU, 64GB RAM, and a GTX 1080Ti GPU.
4.1. Dataset Preparation
The mesh models used in our paper are from [Yi16], including a subset of ShapeNet Core V2 models [shapenet], as well as ModelNet [Wu_2015_CVPR]. These datasets include prealigned models. However, ModelNet does not contain semantic segmentation, and models from [Yi16] sometimes do not have sufficiently detailed segmentation to describe support structure (e.g. the car body and four wheels are treated as a single segment). To facilitate our processing, we use an active learning approach [Yi16] to perform a refined semantic segmentation. We further use refined labels to represent individual parts of the same type, e.g., to have left armrest and right armrest labels for two armrest parts. The statistics of the resulting dataset are shown in Table 1.
Our network takes 3D shapes with consistent segmentation as input. The segmentation of test shapes can be obtained by some supervised methods such as [Qi2017cvpr; Qi2017nips]. Each segmented part is registered from the bounding box by nonrigid deformation. Our method allows each part type to include substantial geometric variations, e.g., the swivel leg and bar stool in Figure 14. Within the confines of the consistent segmentation, the SPVAE is capable of handling variations of part structure and topologies, as well as varying part counts (by marking certain parts as nonexisting).
Category  Airplane  Car  Chair  Table  Mug  Monitor  Guitar 
# Meshes  2690  1824  3746  5266  213  465  787 
# Labels  14  7  10  9  2  3  3 
4.2. Network Architecture
The whole network includes two components, namely PartVAE for encoding the deformation of each part of a shape, and SPVAE for jointly encoding the global structure of the shape and the geometric details of each part.
As illustrated in Figure 4, the structure of the PartVAE has two convolutional layers and one fully connected layer. We use as the activation function, and in the last convolution layer, we use the linear output. The output of the last convolution layer is reshaped to a vector and mapped into a 64dimensional latent space by the fully connected layer. The decoder has a mirrored structure, but not sharing weights with the encoder. We train the PartVAE once for each part type.
The input of the SPVAE is the concatenated representation vector of all parts as shown in Figure 6. The input is fully connected with dimensions 1024, 512 and 256, respectively, and the latent space dimension is 128. Leaky ReLU is set as the activation function.
4.3. Parameters
We use fixed hyperparameters in our experiments for different shape categories. In the following, we perform experiments on the table data in the ShapeNet Core V2 to demonstrate how the method behaves with changing hyperparameters. The dataset is randomly split into the training data (75%) and test data (25%). The generalization of SPVAE is evaluated with different hyperparameters in Table 2, where the bidirectional Chamfer distance is used to measure the reconstruction error on the test data (as unseen data). We perform such tests for 10 times and report the average errors in Table 2. As can be seen, SPVAE has the lowest error with the hyperparameters and , where and are the weights of the reconstruction error term and KL divergence term, respectively. The hyperparameters (weights of reconstruction, KLdivergence, and regularization) of PartVAE are set to the same numbers in [Gao2018]. We set the dimension of the latent space of PartVAE to 64, and the dimension of the latent space of SPVAE to 128. These two parameters are evaluated in Tables 3 and 4 with the reconstruction error. When adjusting the dimension of one VAE, we leave the dimension of the other VAE unchanged.

(0.5, 0.5)  (1.0, 0.5)  (0.5, 1.0)  (1.0, 1.0)  
Recons. Error ()  2.01  1.85  2.24  1.94 

32  64  128  256  

1.92  1.76  1.74  1.82  

2.16  1.85  1.91  2.03 

32  64  128  256  

2.23  1.99  1.85  1.91 
4.4. Training Details
Since a PartVAE encodes the geometry of a specific type of parts, it is trained separately. SPVAE is then trained using PartVAE for encoding part geometry. Training of both VAEs is optimized using the Adam solver [adamsolver]. The PartVAE is trained with 20,000 iterations and SPVAE with 120,000 iterations by minimizing their loss functions. For both VAEs, we set the batch size as 512 and learning rate starting from 0.001 and decaying every 1000 steps with the decay rate set to 0.8. The training batch is randomly sampled from the training data set.
For a typical category, the training of both PartVAE and SPVAE takes about 300 minutes. Once the networks are trained, shape generation is very efficient: generating one shape and structure optimization take only about 36 and 100 milliseconds, respectively.
5. Results and Evaluation
We present the results of shape reconstruction, shape generation and shape interpolation to demonstrate the capability of our method, and compare them with those generated by the stateoftheart methods. We also perform ablation studies to show the advantages of our design. Finally, we present examples to show generalizability (i.e., applying our learned model to new shapes of the same category), editability and limitations of our technique.
Shape Reconstruction. We compare our method with PSG [fan2016point], AtlasNet [AtlasNet2018] and Adaptive OCNN [Wang2018ocnn] on the ShapeNet Core V2 dataset. In this experiment, we choose four representative categories commonly used in the literature to perform both qualitative and quantitative comparisons. Each dataset is randomly split into the training set () and test set (). For fair comparison, we train PSG, AtlasNet, and Adaptive OCNN for individual shape categories, similar to ours. To prepare the input for PSG, we use rendered images under different viewpoints. Given the same input models in the test set, we compare the decoded shapes by different methods. Figures 9 and 10 show the visual comparison of representative results on several test shapes. It can be easily seen that the decoded shapes by PSG, Adaptive OCNN and AtlasNet cannot capture the shapes faithfully. AtlasNet and Adaptive OCNN are able to produce more details than PSG, but suffer from clearly noticeable patch artifacts. In contrast, SDMNET recovers shapes with higher quality and finerdetailed geometry. Note that we compare the existing methods with SPVAE followed by structure optimization instead of SPVAE alone, since structure optimization, which is dependent on the output of SPVAE, is a unique and essential component in our system, and cannot be directly used with the methods being compared due to their lack of structure information.
Moreover, we quantitatively compare our method with the existing methods using common metrics for 3D shape sets, including JensenShannon Divergence (JSD), Coverage (COV) and Minimum Matching Distance (MMD) [achlioptas18a]. The latter two metrics are calculated using both the Chamfer Distance (CD) and Earth Mover’s Distance (EMD) for measuring the distance between shapes. For JSD and MMD, the smaller the better, while for COV, the larger the better. The average results for different methods on these datasets are shown in Table 5. It can be seen that our method achieves the best performance for nearly all the metrics.
Dataset  Methods  Metrics  
JSD  MMDCD  MMDEMD  COVCD  COVEMD  
Airplane  AOCNN  0.0665  0.0167  0.0157  84.3  95.5 
AtlasNet  0.0379  0.0147  0.0132  79.6  82.1  
PSG  0.0681  0.0244  0.0172  33.5  38.9  
Our  0.0192  0.00462  0.00762  87.2  90.6  
Car  AOCNN  0.0649  0.0264  0.0223  60.6  60.8 
AtlasNet  0.0393  0.0228  0.0137  75.4  81.9  
PSG  0.0665  0.0365  0.0247  49.8  59.4  
Our  0.0280  0.00247  0.00101  87.2  88.5  
Chair  AOCNN  0.0384  0.0159  0.0196  43.5  39.3 
AtlasNet  0.0369  0.0137  0.0124  51.1  52.6  
PSG  0.0391  0.0131  0.0152  42.9  49.1  
Our  0.0364  0.00375  0.00764  47.3  55.3  
Table  AOCNN  0.0583  0.0393  0.0256  55.2  40.1 
AtlasNet  0.0324  0.0154  0.0146  59.1  63.7  
PSG  0.0354  0.0271  0.0276  41.2  42.5  
Our  0.0123  0.00183  0.00127  63.3  76.8 
Dataset  Methods  Metrics  
JSD  MMDCD  MMDEMD  COVCD  COVEMD  
Chair  G2L  0.0357  0.0034  0.0682  83.7  83.4 
GRASS  0.0374  0.0030  0.0744  46.0  44.5  
SAGNet  0.0342  0.0024  0.0608  75.1  74.3  
Our  0.0289  0.00274  0.00671  89.3  84.1 
In Table 6 we compare our method with three recent shape generation methods, namely, GRASS [li_sig17], G2L [Wang2018TOG] and SAGNet [pageSAGnet19]. For fair comparison, both of our reconstructed shapes and input shapes are voxelized. Particularly, we make comparisons with GRASS on their chair data since GRASS requires symmetry hierarchies as input for training. The results show that our method outperforms the compared methods in most cases with several metrics. We also show a visual comparison result between GRASS and our method in Figure 11. The GRASS result exhibits some artifacts due to the voxel representation. Even after surface extraction GRASS still fails to capture fine geometric details compared with our method.
Shape Generation. In Figure 12, we make a qualitative comparison between our technique and the globaltolocal method [Wang2018TOG] by randomly generating shapes of airplanes. Their method uses an unconditional GAN architecture and thus cannot reconstruct a specific shape. So two randomly generated, visually similar planes are selected for comparisons. Their voxel based method fails to represent smooth, fine details of 3D shapes. We make further comparison with the globaltolocal method [Wang2018TOG] as well as 3DGAN [3dgan2016] in Figure 13. Again, we select visually similar shapes for comparison, and our method produces high quality shapes with plausible structure and fine details, whereas alternative methods have clear artifacts including fragmented output and rough surfaces. We also compare our technique with GRASS [li_sig17] for random shape generation. As shown in Figure 14, the structures synthesized by GRASS might be problematic, containing parts which are disjoint and/or not well supported. In addition, since it is trained on symmetry hierarchies constructed on top of the shape segmentations, once trained, for new inputs GRASS utilizes automatically generated symmetry hierarchies, which, however, can be inconsistent. This is one of the main causes for GRASS to produce results with a greater level of structural noise including disconnections and asymmetries. In contrast, our results are physically stable and well connected. Note that our refinement step is an integral part of our pipeline and requires structure relations, so cannot be directly applied to GRASS.
As a generative model, our technique is able to generate new shapes. Because our architecture consists of two VAEs, i.e., PartVAE and SPVAE, we can acquire different information from their latent spaces. Specifically, we extract various types of parts and structural information from the latent space of SPVAE, and combine them with the deformation information from PartVAE, to produce novel shapes. Figure 15 gives an example, where our method is used to generate computer monitors with various shapes by sampling in the learned latent space. In this example, the training data is obtained from ModelNet [Wu_2015_CVPR].
Shape Interpolation. Shape interpolation is a useful technique to generate gradually changing shape sequences between a source shape and a target shape. With the help of SPVAE, we first encode the source and target shapes into latent vectors and then perform linear interpolation in the latent space of VAE. A sequence of shapes between the input shape pairs are finally decoded from the linearly interpolated latent vectors. In Figure 16, we compare our technique with AtlasNet [AtlasNet2018] for their performance on shape interpolation. It can be easily seen that the results by AtlasNet suffer from patch artifacts and the surfaces of the interpolated shapes are often not very smooth. The interpolation in our latent space leads to much more realistic results. For example, the armrests gradually become thinner and then disappear in a more natural manner. This is because we combine the geometry and structure during the training of SDMNET, which thus learns the implicit joint distribution of the geometry and structure.
The effectiveness of shape interpolation with SDMNET is consistently observed with additional experiments on different datasets. Figure 17 shows two examples of natural interpolation between shapes with different topologies, thanks to our flexible structure representation. Figure 18 shows an additional interpolation result with substantial change of geometry.
Dataset  Car  Chair  Guitar  Airplane  Table 
Separate  2.77  3.89  3.58  4.87  1.85 
EndtoEnd  5.07  6.73  7.44  11.38  4.86 


Ablation Studies. We perform several ablation studies to demonstrate the necessity of key components of our architecture.
Support vs. adjacency relationships.
We adopt support relationships in our method, rather than adjacency relationships to get well connected shapes, because support relationships ensure generating physically stable shapes, and provide a natural order which is useful to simplify the structure refinement optimization. In contrast, using a bidirectional adjacency, it would be much more complicated to formulate and optimize constraints between two adjacent parts. To evaluate the effectiveness of the support relationships, we replace the support constraints by simply minimizing the distance between every pair of adjacent parts to approximate the adjacency relationships. The effects of using support and adjacency constraints are shown in Figure 19. It can be seen that the support constraints lead to a physically more stable result.
Separate vs. endtoend training.
We adopt separate training for the twolevel VAE, i.e. PartVAEs are trained for individual part types first, before training SPVAE where the geometries of parts are encoded with the trained PartVAEs. The twolevel VAE could also be trained endtoend, i.e., optimizing both PartVAEs and SPVAE simultaneously. We compare the average bidirectional Chamfer distance of the reconstruction of each part between endtoend training and separate training adopted in our solution, as given in Table 7. The visual comparisons are shown in Figure 20. Since without the help of the welltrained distribution of the latent space of individual parts, endtoend training would result in optimization stuck at a poor local minimum, leading to higher reconstruction errors and visually poor results.
Joint vs. decoupled structure and part geometry encoding.
In this paper, the geometry details represented as part deformations are encoded into the SPVAE embedding space jointly (see Section 3.4). This approach ensures that generated shapes have consistent structure and geometry, and the geometries of different parts are also coherent. We compare our solution with an alternative approach where the structure and geometry encodings are decoupled: SPVAE only encodes the structure and the geometry of each part is separately encoded using a PartVAE. Figure 21 shows randomly generated shapes with both approaches. The structure of the first example implies the shape is a sofa, but the geometry of the seat part in (a) does not look like a sofa, whereas our method generates part geometry consistent with the structure. For the second example, our method produces parts with coherent geometry (b), whereas using decoupled structure and geometry leads to inconsistent part styles (a).
Resolution of bounding boxes.
By default, our method uses bounding boxes each with triangles. We also try using lower and higher resolution bounding boxes. As shown in Figure 22, using lower resolution (b) cannot capture the details of the shape, and using higher resolution (d) produces very similar result as our default setting (c), but takes longer time. Our default setting (c) provides a good balance between efficiency and quality.
PartVAE per part type vs. single PartVAE
In our paper, we train a PartVAE for each part type. We compare this with an alternative approach where a single PartVAE is trained for all part categories. As shown in Figure 22 (e), this approach is not effective in capturing unique geometric features of different parts, leading to poor geometric reconstruction.
Generalizability. Figure 23 shows an example that demonstrates the generalizability of our method, to process new shapes of the same category without input semantic segmentation. We first train PointNet++ [Qi2017nips] on our labeled dataset, which is then used for semantic segmentation of the new shape. Finally, we obtain the reconstruction result by our SDMNET. An example is shown in Figure 23, which demonstrates that semantic segmentation obtained automatically can be effectively used as input to our method.
Watertight models. The direct output of our method includes watertight meshes for individual parts, but not the shape as a whole. As demonstrated in Figure 24, by applying a watertight reconstruction technique [huang2018robust], watertight meshes can be obtained, which benefit certain downstream applications.
Editability. Our generative model produces Structure Deformable Meshes, which are immediately editable in a structureaware manner. This is difficult for other generative methods (e.g. [3dgan2016; li_sig17]). An example is given in Figure 25, which shows an editing sequence, including removing parts (when a part is removed, its symmetric part is also removed), making a chair leg longer, which also affects other chair legs due to the equal length constraint, and further deforming the chair back using an offtheshelf deformation method [ARAP2007]. During shape deformation, the editing constraints are treated as hard constraints and the equal length constraints are used in the refinement step (see Section 3.5).
Limitations. Although our method can handle a large variety of shapes with flexible structures and fine details, it still suffers from several limitations. While our method can handle shapes with holes formed by multiple parts, if a part itself has holes in it, our deformable box is unable to represent it exactly as the topology of parts cannot be different from genuszero boxes. In this case, our method will try to preserve the mesh geometry but cannot maintain the hole. For certain parts which are unusual (e.g. the legs and back of the chair and the headstock of the guitar in Figure 26), our VAE architecture considers such cases as outliers, and “projects” them back to deformations consistent with the training set. Another limitation is that currently SDMNET is trained using a collection of shapes with the same category. It thus cannot be used for interpolating shapes of different categories.


6. Conclusions and Future Work
In this paper, we have presented SDMNET, a novel deep generative model that generates 3D shapes as Structured Deformable Meshes. A shape is represented using a set of deformable boxes, and a twolevel VAE is built to encode local geometry variations of individual parts, and global structure and geometries of all parts, respectively. Our representation achieves both flexible topology and fine geometric details, outperforming the stateoftheart methods for both shape generation and shape interpolation.
As future work, our method could be generalized to reconstruct shapes from images. Similar to [Xin2019], which uses a deep neural network to learn the segmentation masks of cylinder regions from a given image for reconstructing 3D models composed of cylindrical shapes, a possible approach to extend our method is to learn the segmentation of different parts in images and use such segmentation results as the conditions of the SPVAE for 3D shape reconstruction. In this case, our SDMNET makes it possible to generate rich shapes with details to better match given images. By exploiting the latent space of our network, our approach could also be generalized for datadriven deformation by incorporating user editing constraints in the optimization framework. It is also interesting to investigate how to extend our method to encode shapes of different categories using a single network. Our current approach can generate parts with the same resolution as the primitive bounding box mesh. We currently utilize a highresolution mesh with fixed size, and therefore our generated shapes take up large storage space because of richer geometric details and higher resolution of meshes. However, since different kinds of parts have different geometric richness, it would be better to exploit (possibly different types of) primitives with adaptive resolutions for different parts so that we can preserve the same level of details but with significantly less storage space.
Acknowledgements.
This work was supported by National Natural Science Foundation of China (No. 61828204 and No. 61872440), Beijing Municipal Natural Science Foundation (No. L182016), Youth Innovation Promotion Association CAS, CCFTencent Open Fund, SenseTime Research Fund. Hongbo Fu was partially supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 11237116, CityU 11300615), and the Centre for Applied Computing and Interactive Media (ACIM) of School of Creative Media, CityU.References
Appendix: support relationship formulation.
Let and be the bounding boxes of parts and projected onto the plane orthogonal to the supporting direction. They should satisfy either or . This constraint can be formulated as an integer programming problem and solved efficiently during the optimization as follows:
Let and be the two directions in the tangential plane. Denote by and two auxiliary binary variables, , this is equivalent to
(5)  
(6)  
(7) 
where is a large positive number (larger than any possible coordinate in the shape), Eq. (7) is true if at most one of or can be 1, i.e., at least one of them is 0. Without loss of generality, assuming , then the set of equations in (5) without the term involving is true, meaning . Similarly, when , it satisfies that .