Multi-chart Generative Surface Modeling

Multi-chart Generative Surface Modeling

Heli Ben Hamu    Haggai Maron    Itay Kezurer    Gal Avineri    Yaron Lipman
Figure 1: Our method is able to learn shape distribution and generate unseen shapes. This figure shows 1024 human models randomly generated by our method.

Abstract

This paper introduces a 3D shape generative model based on deep neural networks. A new image-like (i.e., tensor) data representation for genus-zero 3D shapes is devised. It is based on the observation that complicated shapes can be well represented by multiple parameterizations (charts), each focusing on a different part of the shape. The new tensor data representation is used as input to Generative Adversarial Networks for the task of 3D shape generation.

The 3D shape tensor representation is based on a multi-chart structure that enjoys a shape covering property and scale-translation rigidity. Scale-translation rigidity facilitates high quality 3D shape learning and guarantees unique reconstruction. The multi-chart structure uses as input a dataset of 3D shapes (with arbitrary connectivity) and a sparse correspondence between them. The output of our algorithm is a generative model that learns the shape distribution and is able to generate novel shapes, interpolate shapes, and explore the generated shape space. The effectiveness of the method is demonstrated for the task of anatomic shape generation including human body and bone (teeth) shape generation.

1 Introduction

Generative models of 3D shapes facilitate a wide range of applications in computer graphics such as automatic content creation, shape space analysis, shape reconstruction and modeling.

The goal of this paper is to devise a new (deep) 3D generative model for genus-zero surfaces based on Generative Adversarial Networks (GANs) [15]. The main challenge in 3D GANs compared to image GANs is finding a representation of 3D shapes that enables efficient learning. Since standard CNNs work well with image-like data, i.e., tensors, and on the other hand defining CNN on unstructured data seems to pose a challenge [8], most 3D GANs methods concentrate on representing the input shapes in a tensor form. For example, representing the shape using a volumetric grid [36, 31] or depth-maps [30]. Although natural, these representations suffer from either the high dimensionality of volumetric tensors, their crude brick-like approximation properties, or the partial, discontinuous and/or occluded cover achieved with projection based methods. In a recent paper, Groueix [17] represent 3D shapes using multiple charts, where each individual chart is defined as a multilayer perceptron (MLP).

The approach taken in this paper toward 3D shape representation also uses multiple charts but in contrast to previous work the different charts are represented as a single tensor (i.e., regular grid of numbers) with the following properties: (i) the different charts are related by a so-called multi-chart structure describing their inter relations; (ii) the charts participating in the tensor are smooth (in fact, angle-preserving), bijective, and consistent; and (iii) standard tensor convolution used in off-the-shelf CNNs can be applied to this tensor representation and is equivalent to a well defined convolution on (a cover of) the original surface.

Figure 2: Automatic random generation of 25 teeth models.

The multi-chart structure is the main building block of our 3D shape tensor representation. Intuitively, a multi-chart structure is a collection of conformal charts, each defined using a triplet of points that, together, cover with small distortion all parts of the shape and are scale-translation rigid. Scale-translation (s-t) rigidity is a property that allows recovering the mean and scale of all the charts in a unique manner. S-t rigidity turns out to be significant as the training process has to be performed on normalized charts for effective 3D shape learning. We study s-t rigidity of the multi-chart structure showing it is a generic property and providing a simple sufficient condition for it. The multi-chart structure requires only a sparse set of correspondences as input and allows processing shapes with different connectivity and unknown dense correspondence using standard image GAN frameworks.

We tested our multi-chart 3D GAN method on two classes of shapes: human body and anatomical bone surfaces (teeth). For human body shapes we used datasets of human bodies [6, 37] consisting of different humans in a collection of poses as input to our 3D GAN framework. Figure 1 shows rendering of 1024 models randomly generated using the trained multi-chart 3D GAN. Note the diversity of the human body shapes and poses created by the generative model. For bone surfaces we used the teeth dataset in [7]; Figure 2 depicts teeth randomly generated using our method. As we demonstrate in this paper, our method compares favourably to different baselines and previous methods for 3D shape generation.

2 Previous Work

Generative adversarial networks.

Generative adversarial networks (GANs) are deep neural networks aimed at generating data from a given distribution [16]. GANs are composed of two sub-networks (often convolutional neural networks): a generator, which is in charge of generating an instance from the distribution of interest and a discriminator that tries to discriminate between instances that were sampled from the original distribution and instances that were generated by the generator. The training process alternates between optimizing the discriminator to recognize the true samples, and optimizing the generator to fool the discriminator. These models have become very popular in the last few years and were used to generate many data types such as images [16], videos [32], 3D data (as will be reviewed below) and more. Our work uses GANs in order to generate surfaces of a certain class. We will dedicate the rest of this section to generation of 3D data. For further details on GANs see [15].

Volumetric data generation.

A natural way to use deep learning for 3D data generation is to work on volumetric grids and corresponding volumetric convolutions [36]. Usually, the shape is represented using an occupancy function on the grid. Most approaches use autoencoders or GANs as the generative model.

Multiple works take different inputs such as a 3D scan with missing parts or images: [10] use convolutional neural networks in order to fill in missing parts in scanned 3D models, a task that was also recently targeted by [33] (using GANs and recurrent convolutional networks). [14] propose to learn a distribution of 3D shapes from an input of images of these models using a novel 2D projection layer. In a related work [39] try to generate a 3D model from a single image. Another type of input can be supplied by the user: [22] suggested a system that is based on 3D GANs and user interaction that generates 3D models. A main drawback of these volumetric approaches is the high computational load of working in discretized 3D space which results in low resolution shape representation. [31] tried to bypass this problem by using smart data structures (e.g., octree) for 3D data. Another disadvantage is the fact that volumetric indicators are not optimal for smooth surface approximation, resulting in brick-like shape approximation.

Point cloud data generation.

Some works have targeted the generation of 3D point clouds. This type of 3D data representation resolves the resolution limitation of the volumetric representation, but introduces new challenges such as points’ order invariance and equivariance [27, 38]. [12] use this representation for the problem of 3D reconstruction from a single image. [26] use a variational autoencoder [11] in order to generate point clouds and corresponding normals.

Depth maps generation.

Another group of papers have targeted the generation of depth maps (possibly with normals). A depth map is a convenient representation since it is formulated as a tensor (a regular grid of numbers) similarly to standard images.

[30] use an encoder-decoder architecture in order to generate a depth map from a single image. [35] suggest an end to end framework that takes images and generates voxelized 3D models of the shape in them, by estimating depth maps, silhouettes and normal maps as an intermediate step in the network. [24] use drawings as input and generate multi view depth maps and normals which are again fused together to a single output. In a different variation, [29] learn a model that takes a depth map or a silhouette of a shape and generates multiview depth maps and corresponding silhouettes. Using these depth outputs they generate a single point cloud in a post process.

Surface generation.

The last shape representation we discuss is a 3D triangular mesh. This representation includes both a point cloud and connectivity information and is the type of representation we use in this paper.

[21] targeted deformable shape completion using a variational autoencoder, but their framework also allows to sample random human shapes which is the main focus of our paper. Their main limitation, in comparison to our method, is their reliance on consistent input connectivity (i.e., input shapes with the same triangulation and full 1-1 correspondences) while we only rely on a sparse set of consistent landmarks. This allows us to learn from multiple different datasets consisting of diverse shapes with arbitrary triangulations.

[28] use a parameterization to a regular planar domain (an image) and represent the surface using its Euclidean coordinates. [17] use multilayer perceptrons (MLPs) in order to learn multiple parameterization functions directly (i.e., functions ). These works are the most similar to ours: Similarly to [28] we also use parameterizations into a planar domain; we use parameterizations of a cover of the surface to a torus as in [25]. In contrast to their work, our parameterizations are conformal and we use multiple charts that cover the shape and preserve small details. We note that [28] solve for dense correspondences of the input shapes as preprocess, which is a challenging problem that currently cannot always be accomplished with high accuracy. [17], on the other hand, also use multiple parameterizations. Their method is more general than ours as they do not assume sparse correspondences between the shapes, nor assume that the input shapes are of sphere topology. The downside of their approach is that their generated shapes have considerably less details and the generated charts do not match with high accuracy.

Pre-deep learning works.

Multiple works have targeted shape synthesis in the pre-deep learning era. Some works concentrated on composing new shapes from components; [13] suggested an interactive system where a user can assemble shapes from a segmented shape database. [19] learn a generative component based model that is able to generate novel shapes from a certain class.

Another line of works tried to learn the shape space of a certain class of shapes; [2, 3, 23] have all targeted the shape of the human body. In contrast to our work, these works solve for dense correspondences using a specifically-tailored deformation model. We do not use a specific deformation model. Instead, we use a high dimensional generative model to learn the shape space. We further demonstrate that other classes of shapes (e.g., bones) can be learned by the exact same generative model.

3 Method

3.1 Problem statement

Given a collection of surfaces sampled from some distribution in , a collection of surfaces of the same class (e.g., humans, bones), our goal it to learn a generative model of . By generative model we mean a random variable that samples from the distribution .

The surfaces in are represented as surface meshes, namely triplets of the form , where are the vertex set, the edge set, and the face set, respectively. We do not require the meshes to share connectivity nor that a complete correspondence between the meshes is known. Rather, we will assume only a sparse set of landmark correspondences is given, , . In this paper we used (for bones) or (for humans), see Figure 3(b) for visualization of (orange dots) on three human surfaces in . For brevity, we will henceforth remove the superscript from , and .

3.2 Conformal toric charts

Our approach for learning is to reduce the surface generation problem to the image generation problem and use state of the art image GANs. The reduction to the image setting is based on a generalization of [25] to the multi-chart setting. [25] computes charts from the image domain to a surface using conformal charts, , where is a triplet of points, is a topological torus, constructed by stitching four identical copies of , and is the flat torus, namely the square where opposite edges of the square are identified (i.e., periodic square). The torus is used as it is the only topological surface where the standard image convolution in , equipped with periodic padding, corresponds to a continuous, translation invariant operator over the surface. The degrees of freedom in the conformal chart are exactly the choice of triplet , where the center and corners of are mapped to , , see the inset for an illustration.

A conformal chart, while preserving angles, can produce significant area scaling, and different choices of triplets produce low scale distortion in different areas of the surface . In fact, for surfaces with perturbing parts it is impossible to choose a single triplet (chart) that provides low scale distortion everywhere. In this paper we therefore advocate a multi-chart structure allowing to produce global, low scale distortion coverage of surfaces.

Figure 3: (a) Shows the multi-chart structure ; (b) shows three meshes from , for each we show the landmark correspondences (left in each pair) and the maximal scale across all charts (right in each pair), color ranges ; (c) three charts corresponding to three different faces of , we show the chart’s triplet of points on the surface (left), the coordinates flattened to (bottom) and the geometry reconstructed from this chart (right). Different charts provide low distortion coverage of different areas of the surface.

3.3 Multi-chart structure

The multi-chart structure is a collection of charts that collectively represents a shape. Each chart

(1)

is defined by a triplet of landmark points chosen from the collection of landmark points on the surface . The multi-chart structure is therefore a pair , where is the set of landmarks and is an abstract triangulation, where is the vertex set, the edge set, and the face set. Every face of the multi-chart triangulation represents a chart , as in (1). We will abuse notation and write . See Figure 3(a) for a visualization of the multi-chart triangulation embedded in , and 3(b) for visualization of the landmarks on three human surfaces.

Every mesh in our collection has charts (in this paper we choose (for bones) or (for humans)). Figure 3(c) shows three charts of a single mesh ; for each chart we show: its defining triplet of landmarks from set by a face in the triangulation (orange dots), the chart itself, , visualized as RGB image over , and the geometry captured by restricted to a finite mesh overlaid on .

In order to faithfully represent shapes and enable effective 3D shape learning, the multi-chart structure should possess the following two properties: Covering property and Scale-translation (s-t) rigidity.

Covering property

Each face (triplet) in the multi-chart structure zooms-in on a different part of the surface. As the meshes are assumed to be of the same class (e.g., humans), it is usually possible to choose a multi-chart structure such that the chart collection produces a good coverage of all meshes in .

Figure 3(b) illustrates three different meshes colored according to the maximal area scale exerted by the different charts at each point on the surface. Note that almost everywhere the scale function is greater than . This means that every part of the original surface is represented in at-least one of the charts with scale factor bounded by .

Scale-translation (s-t) rigidity property.

As we demonstrate in Section 5, when training a network to predict multi-charts it is imperative that all the charts are centered (zero mean) and of the same scale (unit norm); This assures the network does not concentrate on learning large-norm charts (e.g., torso) while neglecting small-norm charts (e.g., head or hand).

Centering and scaling of the charts results in the loss of their natural scale and mean value. Thus, to reconstruct a shape from centered-scaled multi-charts (which are the output of our network) we need to recover, for each chart, a unique scale and mean (referred also as translation). Each centered-scaled chart contains a triplet of points in ,

(2)

that are a centered-scaled version of the original landmarks in . S-t rigidity is the property that allows reconstructing the original scale and translation (i.e., mean) of the charts:

Definition 1.

A multi-chart structure is scale-translation (s-t) rigid if given a set of centered-scaled triplets, Eq. (2), the original landmarks can be recovered uniquely up to a global scale and translation.

Let us provide an algebraic characterization to s-t rigidity. Let be the positions in of every vertex in every face (i.e., chart) . Points in corresponding to the same vertices in the triangulation might not be equal (recall that each face is centered and scaled). We would like to find translation and scale per face to reverse the centering and scaling and obtain a unique consistent embedding of the vertices , up to global scale and translation. Consistent means each vertex has the same coordinates in each triangle it belongs to. are solutions to the linear system:

(3)

This is a homogeneous over-determined system of equations where for each solution , also its global scales , and/or global translations , are solutions. To set a unique solution we need to set the scale and translation of a single chart, ,

(4)

S-t rigidity can be equivalently stated in terms of the linear system (3)-(4):

Proposition 1.

A multi-chart structure is scale-translation rigid iff the linear system (3)-(4) has full column-rank.

We prove this proposition in Appendix A. Next, we claim that s-t rigidity is a property depending only on the abstract triangulation and not on a specific choice of landmarks .

Figure 4: Equispaced interpolation between two humans of different body characteristics.
Theorem 1.

A multi-chart structure is either scale-translation rigid for almost all or not scale-translation rigid for any .

This theorem can be proved using the fact that a non-zero multivariate polynomial is non-zero almost everywhere [9]. The full proof can be found in Appendix A.

It is so far not clear that s-t rigid triangulations even exist. Furthermore, Proposition 1 does not provide a practical way for designing multi-chart triangulations that are s-t rigid. The following theorem provides a simple sufficient condition for s-t rigidity. The condition is formulated solely in terms of the connectivity of , and apply to all generic , that is where every 4 landmarks are not co-planar.

Theorem 2.

A 2-connected triangulation with chordless cycles of length at most 4 is scale-translation rigid.

Chordless cycles are cycles in the graph that cannot be shortened by an existing edge between non-consecutive vertices in the cycle. The theorem is proved in Appendix A. The inset shows three triangulations (from left to right): an s-t non-rigid due to failure of the 2-connectedness (at the yellow vertex, for example); s-t rigid ; and s-t non-rigid with a chordless cycle of length 5.

Several comments are in order: First, as shown in the inset there are graphs with chordless cycle of length 5 that are not s-t rigid and therefore the above theorem cannot be strengthened by simply replacing 4 with 5; second, the theorem can be strengthened by considering only cycles between s-t rigid components; third, the generic condition can be weakened by enforcing it only on chordless cycles. Lastly, the notion of s-t rigidity is related to the notion of parallel rigidity. A graph is parallel rigid if it does not have a non-trivial parallel redrawing, where parallel redrawing is a different graph with edges parallel to the edges in . Necessary and sufficient conditions for parallel rigidity exist (see e.g., Theorem 8.2.2 in [34]), however these conditions are harder to work with in comparison to Theorem 2.

In this paper we use multi-chart structures with triangulations that satisfy the sufficient condition to s-t rigidity as described in Theorem 2. Figures 3(a)-(b) show this multi-chart structure in the case of human body shape.

3.4 Mesh to tensor data

The multi-chart structure is used to transfer the input mesh collection into a collection of standard image tensor data as follows.

We consider the coordinate functions over the meshes, , and use our multi-chart structure to transfer these coordinate functions to images. Given a chart , we pull the coordinate functions to the flat torus via

(5)

and sample it on a regular grid of , where in this paper we use . This leads to tensor input data . Figure 3(c) shows three tensors as colored square images for three different charts . Concatenating all charts per mesh gives the final multi-chart tensor representation of mesh ,

(6)

contains all geometric data for mesh , and the entire input tensor data is . Differently from images that contain channels, every instance of our data contains channels in groups. Each contains the three coordinate functions of the surface transferred to using a different chart. As discussed above, since different charts have different mean and scale/variance (e.g., torso and head) it is important that the different channels in are normalized, i.e., each is centered and scaled to be of unit norm (variance). Otherwise the learning process is suboptimal for the small, non-centered charts, e.g., those corresponding to head and hands. Therefore, in our data we normalize all charts, . Of course, we can only do that if there is a unique way to recover scale and translation per chart, which is the case if the multi-chart triangulation is scale-translation rigid.

The charts, , map to four copies of the surface . Accordingly, the tensor also contains four copies of the surface’s coordinate data. We denote by

(7)

the entries of corresponding to (one copy of) a triplet of landmarks in .

Figure 5: The generator and discriminator architecture.

3.5 Architecture and layers

We apply the image GAN technique [16] to learn our surface generator from the input set of surfaces represented as tensors of the same dimensions, . The loss function used to train the generator is defined using a discriminator , which is also a deep network aiming to classify input multi-chart data to either a real surface, or a generated surface. The discriminator is fed with both real instances and generated instances and optimizes a loss trying to correctly discriminate between the two.

In this paper we use a similar architecture to [20] without the progressive growing part, that is, we do not change the resolution during learning. The loss we use is the Wasserstein loss [18] . We apply the following changes to the network to adapt to our geometric setting. The architectures of the generator and discriminator are shown in Figure 5 and more details can be found in Appendix B.

Number of channels.

First, we change the number of output channels generated by to and rescale number of channels accordingly in both G and D, see Appendix B for all channel sizes.

Periodic convolutions and deconvolutions, symmetric projection.

Second, similarly to [25], all convolutions are applied with periodic padding to account for the original surface’s topology. Deconvolutions are implemented by bilinear upsampling followed with a periodic convolution (as in [20]). Furthermore, since we are working with four copies of the surface we also incorporate the (max) symmetry projection layer after the convolution layers of the generator that makes sure all four copies are identical (again, as in [25]), see Figure 5.

Landmark consistency.

Third, our data is per-chart normalized and hence the generator will also learn (approximately) normalized charts , . One property that always holds for the data is that there exists a unique scale and translation per chart that solves (3)-(4) exactly. We will therefore enforce this exactness condition on the generator output .

We implement a layer, called landmark consistency, that given a generated tensor extracts the landmark values (as in (7)), and solves the linear system (3)-(4) with , in the least-squares sense. Note that Theorem 1 implies the existence of a unique solution in this case, almost always. Then, we transform each triplet in by the respective scale and translation, obtaining new locations for the landmarks denoted , replacing each landmark value with the average of all values corresponding to the same landmark, and transforming back by subtracting the translation and dividing by the scale, . Lastly we replace the values in with the new, consistent values .

Zero mean.

Lastly, a zero-mean layer is implemented, reducing the mean of every chart in the generated tensor . As mentioned above, this condition is also satisfied by our train data .

Figure 6: Reconstruction of a full shape from multiple charts. (a) several generated charts. (b) All charts after solving for scales and translations. (c) Reconstructed mesh with color coding of the charts with maximal scale used for each point.

3.6 Reconstruction

The last part in our pipeline is reconstructing a surface from the generative model output . The reconstruction includes two steps: (i) recover a scale and translation per chart in ; and (ii) extract vertex coordinates of a template mesh from .

Recover scale and translations.

The first step in reconstructing a surface out of the generator output is to recover a scale and translation per chart . This is done by solving the linear system (3)-(4) where are the landmark values from the different charts in . Since our network includes a landmark consistency projection layer (see Subsection 3.5) there exists an exact solution to this system. The solution to this system is unique due to the scale-translation rigidity of . Let denote the scaled and translated charts of by the scales and translation achieved by solving the linear system. Figure 6 (a) shows examples of the different charts in ; and (b) shows the different charts of embedded in after solving for and rectifying the scales and translations.

Template fitting.

In the second stage of the reconstruction process we use as template mesh, , a per-vertex average of the rest-pose models in DFAUST [6]. We reconstruct the final mesh using data from . We use the connectivity of (i.e., and ) and set the vertices location using the multi-chart structure as follows,

(8)

where is the inverse area scale of the -ring of vertex exerted by chart of the template mesh, ; is the image of the point under the learned chart , computed by bilinear interpolation of in each of its grid cells. Equation 8 makes sense since each point’s coordinate is mainly influenced by the charts that represent it well. Figure 6(c) shows the final reconstruction with color coding of the charts with maximal scale used for each point; note the similarity of the (b) and (c).

4 Implementation Details

4.1 Datasets

Humans

The training set we have used for human body generation consists of two large datasets of human models: DFAUST [6] and CAESAR [37]. The DFAUST dataset contains models in multiple body poses of ten different people. The CAESAR dataset compliments DFAUST and contains about models of different people in rest pose. Both datasets are aligned internally. We align both datasets by removing the mean from each shape, scaling it to have a surface area of and solving for the optimal rotation to fit a set of landmarks between the datasets using Singular-value decomposition (SVD). As each of these datasets has consistent vertex numbering, we manually select the set of landmarks on a single model from each dataset. For this shape class we used a multi-chart structure that consists of 16 triangles and 21 landmarks. This is demonstrated in figure 3. In order to make our training set balanced we chose 8244 models from DFAUST (by taking each fifth shape) and doubled the number of CAESAR models to 5750.

Bones

We also evaluated our method on anatomical surfaces [7]. We used 70 models and as in [25] we converted the meshes to sphere-type topology. We also extrinsically aligned the teeth using their landmarks. For this shape class we used a multi-chart structure that consists of 4 triangles and 6 landmarks.

Figure 7: A naive single-chart surface generation approach with [25]. In each pair: left - generated charts; right - reconstructed surface. Although generating individual charts of high quality, the different charts are learned independently and consequently do not fit.

4.2 Training details

We implemented the networks using the TensorFlow library [1] in python. For the larger network that generates human models, we perform synchronous training on 2 NVIDIA p100 GPUs and for the smaller network that generates teeth we use a single p100 GPU. During training, we alternate between processing a batch for the generator and processing a batch for the discriminator. One epoch takes sec and sec for the humans and teeth networks respectively. The networks converge after 800 and 80k epochs for the humans and teeth respectively. Generating a new surface takes sec, from which the feed-forward takes sec on a single p100 GPU and the reconstruction takes sec (CPU).

Due to noise in during the learning process we start the training with no landmark consistency layer. After 50 or 10k epochs for humans and teeth, respectively, we add the landmark consistency layer. To avoid bias in scale and translation we randomize the fixed chart at each iteration. Furthermore, to overcome numerical instabilities we add a regularization term to the least-squares (3)-(4) system of the form

(9)

where is a parameter and is the average scale of the chart as computed in a preprocess across the entire data . We set for the next 450 epochs and then reduced by a multiplicative factor of every epoch, until a total of 800 epochs is reached. For the teeth generating network the addition of the regularization term was not needed.

5 Evaluation

No normalization
No projection
Ours
Figure 8: Comparison with two variations of our algorithm. In each pair, the left model shows all the individual generated charts and the right model is a final reconstruction.
Figure 9: Comparison of human models generated by our method (left in each triplet) and their nearest neighbors in the training set (middle and right in each triplet). In the middle of each triplet we show the nearest training model reconstructed from its charts using our reconstruction pipeline; on the right we show the original surface mesh from the dataset. The blow-ups emphasis the differences between generated and real face examples.

In this section we compare our method to several baseline methods, all of which are variations of our approach. We also present a nearest neighbor evaluation, for testing the ability of our method to generate novel shapes.

5.1 Single-chart surface generation

A naive adaptation of the approach presented in [25] to surface generation is to train a network that generates a single chart at each feed-forward and stitch the generated charts in a postprocess. The output of G of this network is a single chart of dimensions and the capacity of the network was reduced compared to the multi-chart network accordingly. We trained this network using the same data we used for our method, feeding a random chart at each iteration. Figure 7 shows a few typical examples generated using this approach. In order to generate the first two models (left and middle) we selected random charts until we had all 16 necessary charts. For the last model we cherry-picked specific charts that seemed to fit reasonably. In all cases we ran our reconstruction algorithm, with the exception of using the mean charts’ scales and solving only for the translations (solving for the scales as well resulted in worse results). This comparison shows that different charts of the same shape should be jointly learned.

5.2 Chart normalization and landmark consistency

We compared our method to two other baseline methods: (a) Learning the multi-chart structure without chart normalization (centering and scale), (b) Learning the multi-chart structure without the landmark consistency layer (as described in 3.5). Figure 8 compares baselines (a)-(b) to our method by depicting several typical examples. The first row shows baseline (a), the second row baseline (b) and the third row our algorithm (with normalization and landmark consistency). Note that both the normalization step and the landmark consistency layer are important in order to generate smooth and consistent results.

5.3 Nearest neighbor evaluation

In order to test the method’s ability to generate unseen shapes, we apply our trained generator to multiple random latent variables , and compare the resulting charts to their nearest neighbor in the training data using norm in . In the experiment, shown in Figure 9, we show: left, the reconstructed from the generated example ; middle, the closest model in the training set reconstructed from its charts using our reconstruction pipeline; right, the closest model in its original surface form.

6 Results

Volumetric GAN
AtlasNet [17]
Litany et al. [21]
Ours
Figure 10: Human shape generation. Comparison of our method with volumetric GAN baseline, the approach of [17] and of [21].

6.1 Comparison with alternative approaches

We compare our method with a volumetric GAN approach, a recent approach by [17] and the approach of [21]. Figure 10 shows 8 results generated with each approach.

The volumetric method is implemented according to [36] with resolution (comparable to our tensors). The volumetric generator tends to produce crude, brick-like approximations of the surface shapes, hindering representation of specific body details. The results of [17] were provided by the authors and were trained only on FAUST (200) models [5]. Although this is a smaller dataset than the one we used, the differences between the level of details and surface fidelity are clear. The results of [21] were provided by the authors and are obtained by training on the DFAUST dataset [6]. Note that their variational autoencoder was trained for a different task - shape completion. For this task they have explicitly relaxed the gaussian prior during training which (as mentioned by the authors) might give rise to generation of slightly unrealistic shapes.

Figure 11: Equispaced interpolation between two teeth models.

6.2 Shape interpolation

Our method learns a map from the latent variable space to shape space . This gives us the ability to perform interpolation between two generated shapes . Figure 4 shows equispaced samplings of a latent space line segment between two humans in different poses and body characteristics. Note how the models change in a continuous manner through other, natural models and poses. Figure 11 shows a similar experiment with the teeth surface dataset. The supplementary movie shows interpolation between different humans (and teeth) in the latent space.

6.3 Shape exploration

In this experiment, shown in Figures 12, 13, we computed a 2D grid using bilinear interpolation on four latent vectors and generated the corresponding models. Note how the grid captures gracefully the pose space. These types of grids can be used as means to browse datasets and shape spaces.

Failure cases: Figure 14 shows the result of an experiment of 100 random models generated by our method, where failures are marked in red. Note that the ratio of failures is less than , and in general the failures are also rather plausible human shapes.

Figure 12: Bilinear interpolation of four generated human models.
Figure 13: Bilinear interpolation of four generated teeth models.
Figure 14: 100 random human shape generation with our method. The failures are shown in red.

6.4 Massive-scale data generation.

Lastly, our method can be used for massive generation of plausible random models. Figure 15 shows human models generated by our method, completely automatically. Note the diverse poses and different faces our method is able to generate without human intervention.

Figure 15: Massive data generation of random 10,000 human models.

7 Conclusions

In this paper we present a new method for generating random shapes based on a novel 3D shape representation called multi-chart structure.

The main limitation of our approach is the fact it is restricted to work only with genus-zero (i.e., sphere-type) surfaces. It would be an interesting future work to generalize the method to arbitrary shape topologies, triangle soups and even point clouds. Although opted for conformal mappings, we feel that other parameterization methods (e.g.  area-preserving maps which are used in geometric deep learning [28]) can greatly benefit from our multi-chart representation as-well. Furthermore, we could use our representation with other deep generative models such as variational autoencoders (VAEs).

Currently the reconstruction of the final mesh from the generated charts is done using a fixed template. An interesting future work is to devise more generic ways to reconstruct the final surface mesh from the charts, maybe even incorporate this task into the network. Lastly, we would like to generalize our work to conditional generative models which will allow additional user control of the generated shapes.

8 Acknowledgements

This research was supported in part by the European Research Council (ERC Consolidator Grant, ”LiftMatch” 771136), the Israel Science Foundation (Grant No. 1830/17). We would like thank the authors of AtlasNet [17] and of [21] for sharing their results for comparison.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  • [2] B. Allen, B. Curless, and Z. Popović. The space of human body shapes: reconstruction and parameterization from range scans. In ACM transactions on graphics (TOG), volume 22, pages 587–594. ACM, 2003.
  • [3] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. Scape: shape completion and animation of people. In ACM Transactions on Graphics (TOG), volume 24, pages 408–416. ACM, 2005.
  • [4] anonymous. Multi-chart generative surface modeling. arXiv preprint arXiv:1806, 2018.
  • [5] F. Bogo, J. Romero, M. Loper, and M. J. Black. Faust: Dataset and evaluation for 3d mesh registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3794–3801, 2014.
  • [6] F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black. Dynamic faust: Registering human bodies in motion. In Proc. the Conference on Computer Vision and Pattern Recognition, 2017.
  • [7] D. M. Boyer, Y. Lipman, E. S. Clair, J. Puente, B. A. Patel, T. Funkhouser, J. Jernvall, and I. Daubechies. Algorithms to automatically quantify the geometric similarity of anatomical surfaces. Proceedings of the National Academy of Sciences, 108(45):18221–18226, 2011.
  • [8] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • [9] R. Caron and T. Traynor. The zero set of a polynomial. WSMR Report, pages 05–02, 2005.
  • [10] A. Dai, C. R. Qi, and M. Nießner. Shape completion using 3d-encoder-predictor cnns and shape synthesis. arXiv preprint arXiv:1612.00101, 2016.
  • [11] C. Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
  • [12] H. Fan, H. Su, and L. Guibas. A point set generation network for 3d object reconstruction from a single image. arXiv preprint arXiv:1612.00603, 2016.
  • [13] T. Funkhouser, M. Kazhdan, P. Shilane, P. Min, W. Kiefer, A. Tal, S. Rusinkiewicz, and D. Dobkin. Modeling by example. In ACM Transactions on Graphics (TOG), volume 23, pages 652–663. ACM, 2004.
  • [14] M. Gadelha, S. Maji, and R. Wang. 3d shape induction from 2d views of multiple objects. arXiv preprint arXiv:1612.05872, 2016.
  • [15] I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [17] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry. Atlasnet: A papier-mâché approach to learning 3d surface generation. CVPR, 2018.
  • [18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
  • [19] E. Kalogerakis, S. Chaudhuri, D. Koller, and V. Koltun. A probabilistic model for component-based shape synthesis. ACM Transactions on Graphics (TOG), 31(4):55, 2012.
  • [20] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • [21] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. arXiv preprint arXiv:1712.00268, 2017.
  • [22] J. Liu, F. Yu, and T. Funkhouser. Interactive 3d modeling with a generative adversarial network. arXiv preprint arXiv:1706.05170, 2017.
  • [23] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248, 2015.
  • [24] Z. Lun, M. Gadelha, E. Kalogerakis, S. Maji, and R. Wang. 3d shape reconstruction from sketches via multi-view convolutional networks. arXiv preprint arXiv:1707.06375, 2017.
  • [25] H. Maron, M. Galun, N. Aigerman, M. Trope, N. Dym, E. Yumer, V. G. KIM, and Y. Lipman. Convolutional neural networks on surfaces via seamless toric covers. SIGGRAPH, 2017.
  • [26] C. Nash and C. K. Williams. The shape variational autoencoder: A deep generative model of part-segmented 3d objects. In Computer Graphics Forum, volume 36, pages 1–12. Wiley Online Library, 2017.
  • [27] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
  • [28] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani. Surfnet: Generating 3d shape surfaces using deep residual networks. arXiv preprint arXiv:1703.04079, 2017.
  • [29] A. A. Soltani, H. Huang, J. Wu, T. D. Kulkarni, and J. B. Tenenbaum. Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1511–1519, 2017.
  • [30] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multi-view 3d models from single images with a convolutional network. In European Conference on Computer Vision, pages 322–337. Springer, 2016.
  • [31] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. arXiv preprint arXiv:1703.09438, 2017.
  • [32] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613–621, 2016.
  • [33] W. Wang, Q. Huang, S. You, C. Yang, and U. Neumann. Shape inpainting using 3d generative adversarial network and recurrent convolutional networks. arXiv preprint arXiv:1711.06375, 2017.
  • [34] W. Whiteley. Some matroids from discrete applied geometry. Contemporary Mathematics, 197:171–312, 1996.
  • [35] J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances in Neural Information Processing Systems, pages 540–550, 2017.
  • [36] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
  • [37] Y. Yang, Y. Yu, Y. Zhou, S. Du, J. Davis, and R. Yang. Semantic parametric reshaping of human body models. In 3D Vision (3DV), 2014 2nd International Conference on, volume 2, pages 41–48. IEEE, 2014.
  • [38] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In Advances in Neural Information Processing Systems, pages 3394–3404, 2017.
  • [39] R. Zhu, H. K. Galoogahi, C. Wang, and S. Lucey. Rethinking reprojection: Closing the loop for pose-aware shape reconstruction from a single image. arXiv preprint arXiv:1707.04682, 2017.

Appendix A Proofs

To prove Theorem 2 we will prove a more general result dealing with scale-translation rigidity of graphs with respect to per-edge scale and translation. That is, we consider graphs where each edge can only be scaled and/or translated, but not rotated.

Theorem 3.

Every generic embedding of a 2-connected graph with chordless cycles of length at most 4 is unique up to global scale and translation.

This result directly applies to triangulations, which are also graphs, however with less degrees of freedom as only scale and translation of a whole triangle is allowed.

The general idea of the proof is to first show the theorem for short chordless cycles (Lemma 1) and then use it as a building block for proving s-t rigidity of more general graphs (Theorem 2 and Theorem 3).

Lemma 1.

Every generic embedding of a chordless cycle of length is unique up to global scale and translation.

Proof of lemma 1.

Consider a generic embedding of a cycle of length . Denote the embeddings of vertices of the chordless cycle by where and the set of vectors connecting neighboring vertices by , . The set satisfies:

(10)

or in matrix form where are the columns of :

(11)

Since the embedding is generic,

(12)

where denotes the affine-hull. Therefore the column rank of is and .

Now, assume a different embedding such that one edge is fixed, that is w.l.o.g. , (i.e., is fixed). In particular . Since is an embedding, all vectors are by assumption scaled versions, , where , . Furthermore, satisfies . Since we know that and since we showed above that we get that . That is, , . Since we consequently get that

for all . We showed there could be only one generic embedding and therefore the lemma is proved.

Lemma 2.

Having a graph and its sub-graph . If there exists a simple cycle in G containing an edge from and a vertex from , then there exists a chordless cycle containing an edge from and a vertex from .

Proof of lemma 2.

If is chordless we are done. If not we show that a shorter cycle with the same properties can be found: in this case, there exists an edge with non-consecutive indices. By adding this edge we split the original cycle into two shorter cycles containing . If both endpoints of are from , keep the cycle that also contains the vertex from . Otherwise, keep the cycle containing the edge from . In both cases it is guaranteed that the new chosen cycle is shorter and contains an edge from and a vertex from . Repeating this process, in a finite number of steps, a chordless cycle satisfying the conditions will be obtained. ∎

Proof of Theorem 3.

Let denote a 2-connected graph with chordless cycles of length at most 4, and a generic embedding. We will show that is unique up to global scale and translation.

We define an iterative process that grows an s-t rigid subgraph.

Let be a subgraph defined by a set of vertices . First, set according to , where are two adjacent vertices, i.e., . While there is a chordless cycle that contains a vertex and an edge in add it to .

To finish the proof we need to prove: (i) at every iteration of the algorithm is s-t rigid; and (ii) when the the algorithm terminates .

We start with (i): First, when , is s-t rigid by definition. Now given an s-t rigid , we need to prove that is s-t rigid, where is a chordless cycle as described above. Since all chordless cycles in are of length , by Lemma 1 is s-t rigid. By assumption is s-t rigid, and since and share an edge, their union is s-t rigid.

Next, we prove (ii). Assume towards a contradiction that . Since is connected there exists an edge with one endpoint and the other . Furthermore, since is connected there exists an edge with (see the inset (a)).

Using the 2-connectedness of , we can exclude to obtain a new connected graph . Since is connected, there exists a path between and which does not include (inset (b)). Taking this path and completing it with we get a simple cycle containing an edge and a vertex from (inset (c)). Using Lemma 2 there exists a chordless cycle with an edge in and a vertex in in contradiction to the fact that the algorithm terminated.

Proof of Proposition 1.

Assume by way of contradiction that there exists two embedding that are not related by a global scale and translation, and denote by the corresponding vertex assignments for all the triangles. WLOG we can assume that satisfy Equation (4) by proper scaling and translating. Furthermore, satisfy Equation (3) as well. This implies that is in the kernel of the matrix of Equations (3)-(4) which means it is not full rank.

Assume by way of contradiction that the linear system (3)-(4) does not have full column rank. This implies that there exists two different solutions to the system that agree on the first triangle (Equation (4)). This is a contradiction to the assumption that the triangulation has a unique embedding up to global scale and translation.

Proof of Proposition 1.

Indeed, let be the matrix of the linear system (3)-(4). Since is a centered-scaled version of it can be written as for some , . Therefore, is a polynomial in and and can be written as , where are monomials and polynomials. If all polynomials are the zero polynomials, then is the zero polynomial and is not s-t rigid for all . Otherwise, at-least one is not the zero polynomial. Using the fact that a non-zero polynomial is non-zero almost everywhere [9] we get that for almost every , . Fixing such in we have a non-zero polynomial in and therefore for almost all . ∎

Appendix B Architecture details

GENERATOR
input output
FC 128 4x4x1536
periodic conv 3x3 4x4x1536 4x4x1536
Relu
upsample 4x4x1536 8x8x1536
periodic conv 3x3 8x8x1536 8x8x768
Relu
periodic conv 3x3 8x8x768 8x8x768
Relu
upsample 8x8x768 16x16x768
periodic conv 3x3 16x16x768 16x16x384
Relu
periodic conv 3x3 16x16x384 16x16x384
Relu
upsample 16x16x384 32x32x384
periodic conv 3x3 32x32x384 32x32x192
Relu
periodic conv 3x3 32x32x192 32x32x192
Relu
upsample 3x3 32x32x192 64x64x192
periodic conv 64x64x192 64x64x96
Relu
periodic conv 3x3 64x64x96 64x64x96
Relu
periodic conv 1x1 64x64x96 64x64x48
symmetry projection layer 64x64x48 64x64x48
landmark consistency 64x64x48 64x64x48
zero mean 64x64x48 64x64x48
DISCRIMINATOR
periodic conv 1x1 64x64x48 64x64x96
LeRelu
periodic conv 3x3 64x64x96 64x64x96
LeRelu
periodic conv 3x3 64x64x96 64x64x192
LeRelu
downsample 64x64x192 32x32x192
periodic conv 3x3 32x32x192 32x32x192
LeRelu
periodic conv 3x3 32x32x192 32x32x384
LeRelu
downsample 32x32x384 16x16x384
periodic conv 3x3 16x16x384 16x16x384
LeRelu
periodic conv 3x3 16x16x384 16x16x768
LeRelu
downsample 16x16x768 8x8x768
periodic conv 3x3 8x8x768 8x8x768
LeRelu
periodic conv 3x3 8x8x768 8x8x1536
LeRelu
downsample 8x8x1536 4x4x1536
periodic conv 3x3 4x4x1536 4x4x1536
LeRelu
periodic conv 4x4 4x4x1536 1x1x1536
LeRelu
FC 1x1536 1
Table 1: Architecture details - humans generating network
GENERATOR
input output
FC 32 4x4x256
periodic conv 3x3 4x4x256 4x4x256
Relu
upsample 4x4x256 8x8x256
periodic conv 3x3 8x8x256 8x8x128
Relu
periodic conv 3x3 8x8x128 8x8x128
Relu
upsample 8x8x128 16x16x128
periodic conv 3x3 16x16x128 16x16x64
Relu
periodic conv 3x3 16x16x64 16x16x64
Relu
upsample 16x16x64 32x32x64
periodic conv 3x3 32x32x64 32x32x32
Relu
periodic conv 3x3 32x32x32 32x32x32
Relu
upsample 3x3 32x32x32 64x64x32
periodic conv 64x64x32 64x64x16
Relu
periodic conv 3x3 64x64x16 64x64x16
Relu
periodic conv 1x1 64x64x16 64x64x12
symmetry projection layer 64x64x12 64x64x12
landmark consistency 64x64x12 64x64x12
zero mean 64x64x12 64x64x12
DISCRIMINATOR
periodic conv 1x1 64x64x12 64x64x16
LeRelu
periodic conv 3x3 64x64x16 64x64x16
LeRelu
periodic conv 3x3 64x64x16 64x64x32
LeRelu
downsample 64x64x32 32x32x32
periodic conv 3x3 32x32x32 32x32x32
LeRelu
periodic conv 3x3 32x32x32 32x32x64
LeRelu
downsample 32x32x64 16x16x64
periodic conv 3x3 16x16x64 16x16x64
LeRelu
periodic conv 3x3 16x16x64 16x16x128
LeRelu
downsample 16x16x128 8x8x128
periodic conv 3x3 8x8x128 8x8x128
LeRelu
periodic conv 3x3 8x8x128 8x8x256
LeRelu
downsample 8x8x256 4x4x256
periodic conv 3x3 4x4x256 4x4x256
LeRelu
periodic conv 4x4 4x4x256 1x1x256
LeRelu
FC 1x256 1
Table 2: Architecture details - teeth generating network
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
202100
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description