NASA: Neural Articulated Shape Approximation
Abstract
Efficient representation of articulated objects such as human bodies is an important problem in computer vision and graphics. To efficiently simulate deformation, existing approaches represent objects as meshes and deform them using skinning techniques. This paper introduces neural articulated shape approximation (NASA), a framework that enables efficient representation of articulated deformable objects using neural indicator functions parameterized by pose. In contrast to classic approaches, NASA avoids the need to convert between different representations. For occupancy testing, NASA circumvents the complexity of meshes and mitigates the issue of watertightness. In comparison with regular grids and octrees, our approach provides high resolution without high memory use.
1 Introduction
There has been a surge of recent interest in computer vision in developing better and more flexible 3D representations of objects and scenes [28, 11, 29, 6]. These recent advances are partly motivated by the development of “inverse graphics” pipelines for scene understanding [34]. With the dominance of deep neural networks in computer vision, we have seen inverse graphics flourish, especially when differentiable models of geometry are available. However, among possible applications, neural models of articulated objects have received little attention. Models of articulated objects are particularly important because they encompass 3D representations of humans. Virtual humans are a central subject not only in computer games and animated movies, but also in other applications such as augmented and virtual reality.
Existing geometric learning algorithms include selfsupervised methods for face [32], body [18], and low level geometry [8], all relying on optimization of fully differentiable encoderdecoder architectures. The use of neural decoders is also a possibility [14], but the quality of results can receive a significant boost when more structure about the phenomena being modeled is directly expressed within the architecture; see [33] for an example. Geometric models often must fullfill several purposes such as representing the shape for rendering or representing the volume for the purpose of intersection queries. Although neural models have been used in the context of articulated deformation [2], they have addressed only deformations while relegating both intersection queries and the overall articulation to classic methods, thus sacrificing full differentiability.
Our method represents articulated objects with a differentiable neural model. We train a neural decoder that exploits the structure of the underlying deformation driving the articulated object. As with some previous geometric learning efforts [28, 8, 29, 4] we represent geometry by indicator functions – also referred to as occupancy functions – that evaluate to inside the object and otherwise. If desired, an explicit surface can be extracted via marching cubes [24]. Unlike previous approaches, which focused on collections of static objects described by (unknown) shape parameters, we look at learning indicator functions as we vary pose parameters, which will be discovered by training on animation sequences. Overall, our contributions are:

We propose a way to approximate articulated deformable models via neural networks – the core idea is to model shapes by networks that encode a [quasi] piecewise rigid decomposition;

We show how explicitly expressing structure of deformation in the network allows for fewer model parameters while providing both similar performance and better generalization;

The indicator function representation supports efficient intersection and collision queries, avoiding the need to convert to a separate representation for this purpose;
2 Related works
Neural shape approximation provides a single framework that addresses problems that have previously been approached separately. The related literature thus includes a number of works across several different fields.
Skinning algorithms. Efficient articulated deformation is traditionally accomplished with a skinning algorithm that deforms vertices of a mesh surface as the joints of an underlying abstract skeleton change. The classic linear blend skinning (LBS) algorithm expresses the deformed vertex as a weighted sum of that vertex rigidly transformed by several adjacent bones; see [16] for details. LBS is widely used in computer games, and is a core ingredient of popular vision models [23]. Mesh sequences of general (not necessarily articulated) deforming objects have also been represented with skinning for the purposes of compression and manipulation, using a collection of nonhierarchical “bones” (transformations) discovered with clustering [17, 20]. LBS has wellknown disadvantages: the deformation has a simple algorithmic form that cannot produce posedependent detail, it results in characteristic volumeloss effects such as the “collapsing elbow” and “candy wrapper” artifacts [21, Figs. 2,3] , and for best results the weights must be manually painted by artists. It is possible to add posedependent detail with a deep net regression [2], but this process operates as a correction to classical LBS deformation.
Object intersection queries. Registration, template matching, 3D tracking, collision detection, and other tasks require efficient inside/outside tests. A disadvantage of polygonal meshes is that they do not efficiently support these queries, as meshes often contain thousands of individual triangles that must be tested for each query. This has led to the development of a variety of spatial data structures to accelerate pointobject queries [22, 31], including voxel grids, octrees, and others. In the case of deforming objects, the spatial data structure must be repeatedly rebuilt as the object deforms. A further problem is that typical meshes may be constructed without regard to being “watertight” and thus do not have a clearly defined interior [15].
Partbased representations. For object intersection queries on articulated objects, it can be more efficient to approximate the overall shape in terms of a moving collection of rigid parts, such as spheres or ellipsoids, that support an efficient intersection test [30]. Unfortunately this has the drawbacks of introducing a second approximate representation that does not exactly match the originally desired deformation. A further core challenge, and subject of continuing research, is the automatic creation of this partbased representation [1, 7, 12]. Unsupervised part discovery has been recently tacked by a number of deep learning approaches [8, 25, 5, 9, 10]. In general these methods address analysis and correspondence across shape collections, and do not target accurate representations of articulated deforming objects. Posedependent deformation effects are also not considered in any of these approaches.
Neural implicit object representation. Finally, several recent works represent objects with neural implicit functions [28, 4, 29]. These works focus on the neural representation of static shapes in an aligned canonical frame and do not target the modeling of transformations. Our work can be considered an extension of these methods, where the core difference is its ability to efficiently represent complex and detailed articulated objects (e.g. human bodies).
3 Neural Articulated Shapes Approximation
Figure 2 illustrates the problem of articulated shape approximation in 2D. We are provided with an articulated object in the rest pose (the typical Tpose) and the corresponding occupancy function . In addition, we are provided with a collection of groundtruth occupancies associated with poses. In our formulation, each pose parameter represents a set of of posed transformations associated with bones, i.e., . To help disambiguate the part whole relationship, we also assume that for each mesh vertex , the skinning weights are available, where with .
Given the collection of pose parameters , we desire to query the corresponding indicator function at a point . This task is more complicated than might seem, as in the general setting this operation requires the computation of generalized winding numbers [15]. However, when given a database of poses and corresponding ground truth indicator , we can formulate our problem as the minimization of the following objective:
(1) 
where is a density representing the sampling distribution of points in (Section 4.4) and is a neural network with parameters that represents our neural shape approximator. We adopt a sampling distribution that randomly samples in the volume surrounding a posed character, along with additional samples in the vicinity of the deformed surface.
One can view as a binary classifier that aims to separate the interior of the shape from its exterior. Accordingly, one can use a binary crossentropy loss for optimization, but our preliminary experiments suggest that both L2 and crossentropy losses perform similarly for shape approximation. Thus, we adopt (1) in our experiments.
4 Neural Architectures for NASA
We investigate several neural architectures for the problem of articulated shape approximation. The unstructured architecture in Section 4.1 does not explicitly encode the knowledge of articulated deformation. However, typical articulated deformation models [23] express deformed mesh vertices reusing the information stored in rest vertices . Hence, we can assume that computing the function in the deformed pose can be done by reasoning about the information stored at rest pose . Taking inspiration from this observation, we investigate two different architecture variants, one that models geometry via a piecewiserigid assumption (Section 4.2), and one that relaxes this assumption and employs a quasirigid decomposition, where the shape of each element can deform according to the pose (Section 4.3).
4.1 Unstructured model – “U”
Recently, a series of papers [4, 29, 28] tackled the problem of modeling occupancy across shape datasets as , where is a latent code learned to encode the shape. These techniques employ deep and fully connected networks, which one can adapt to our setting by replacing the shape with pose parameters , and using a neural network that takes as input . ReLU activations are used for inner layers of the neural net and a sigmoid activation is used for the final output so that the occupancy prediction is bounded between and .
To provide pose information to the network, one can simply concatenate the set of affine bone transformations to the query point to obtain as the input. This results in an input tensor of size . Instead, we propose to represent the composition of a query point with a pose via , resulting in a smaller input of size . Our unstructured baseline takes the form:
(2) 
We term this the unstructured model as it does not explicitly model the underlying deformation process.
4.2 Piecewise rigid model – “R”
The simplest structured deformation model for articulated objects assumes our object can be represented via a piecewise rigid composition of elements; e.g. [30, 27]:
(3) 
We observe that if these elements are related to corresponding restpose elements through the rigid transformations , then it is possible to query the corresponding restpose indicator as:
(4) 
where and similar (2) we can represent each of components via a learnable indicator . This formulation assumes that the local shape of each learned bone component stays constant across the range of poses when viewed from the corresponding coordinate frame, which is only a crude approximation of the deformation in realistic characters and other deformable shapes.
4.3 Piecewise deformable model – “D”
We can generalize our models by combining the model of (2) to the one in (4), hence allowing each of the elements to be adjusted in shape conditional on the pose of the model:
(5) 
Similarly to we use a collection of learnable indicator functions in rest pose , and to encode pose conditionals we take inspiration from (2). More specifically, we express our model as:
(6) 
Here is the translation vector of the root bone in homogeneous coordinates, and the pose of the model is represented as . Similarly to (4), we model this function via dense layers . While the input dimensionality of this network is , which is similar to the dimensionality in (2), we will see that the necessary network capacity to achieve comparable approximation performance, especially in extrapolation settings, is much lower.
4.4 Technical details
We now detail the auxiliary losses we employ to facilitate learning, and the architecture of the network backbones.
Auxiliary loss – skinning weights
As most deformable models are equipped with skinning weights, we exploit this information to facilitate learning of the partbased models (i.e. “R” and “D”). In particular, we label each mesh vertex with the index of the corresponding highest skinning weight value , and use the loss:
(7) 
where when , and otherwise – by convention, the ½ level set of the indicator is the surface our occupancy represents. In the supplementary material, we conduct an ablation study on the effectiveness of showing that this loss is necessary for effective shape decomposition. Without such a loss, we could end up in the situation where a single (deformable) part could end up being used to describe the entire deformable model, and the trivial solution (zero) would be returned for all other parts.
Auxiliary loss – parsimony
As parts create a whole via a simple union, nothing prevents unnecessary overlaps between parts. To remove this nullspace from our training, we seek a minimal description by penalizing the volume of each part:
(8) 
This loss improves generalization, as quantified in the supplementary material.
Training
Given , and as found through hyperparameter tuning, the overall loss for our model is:
All models are trained with the Adam optimizer, with batch size and learning rate . For better gradient propagation, we use softmax whenever a max was employed in our expressions. For each optimization step, we use points sampled uniformly within the bounding box and points sampled near the ground truth surface. For all the 2D experiments, we train the model for K iterations which takes approximately hours on a single NVIDIA Tesla V100. For 3D experiments, the models are trained for K iterations for approximately hours.
Network architectures
To keep our experiments comparable across baselines, we use the same network architecture for all the models while varying the width of the layers. The network backbone is similar to DeepSDF [29], but simplified to 4 layers. Each layer has a residual connection, and uses the Leaky ReLU activation function. All layers have the same size, which we vary from 88 to 760 according to the experiment (i.e., a backbone with 88 hidden units in the first layer will be marked as “@88”). For the piecewise (4) and deformable (6) models note the neurons are distributed across different channels; e.g. with R@960 we mean that each of the branches will be processed by dense layers having neurons. Similarly to the use of grouped filters/convolutions [19, 13], note that such a structure allows for significant performance boosts compared to unstructured models (2), as the different branches can be executed in parallel on separate compute devices.
5 Evaluation
We employ two datasets to evaluate our method in 2D and 3D. The datasets consist of a rest configuration surface, sampled indicator functions values, bone transformation frames per pose, and skinning weights. The ground truth indicator functions were robustly computed via generalized winding numbers [15], and are evaluated in a regular grid surrounding the deformed surface with additional samples on the surface. The performance of the models can be evaluated by comparing the Intersection over Union (IOU) of the predicted indicator values against the ground truth samples on a regular grid.
5.1 Analysis on 2D data
Our gingerbread dataset consists of 100 different poses sampled from a temporally coherent animation. The animation drives the geometry in two different ways: \⃝raisebox{0.6pt}{1} in the rigid dataset, we have a collection of surfaces, and each surface region is rigidly attached to a single bone which does not change shape as the pose changes; \⃝raisebox{0.6pt}{2} in the blended dataset, we employ the skinning weights to deform the surfaces via LBS. Our 2D results are summarized in Figure 3: given enough neural capacity, both the unstructured and deformable model are able to overfit to the training data. Note that since the animation produced via skinning exhibits highly nonrigid (i.e. blended) deformations, the rigid model struggles.
Unstructured model – “U”
Looking at overfitting results can be misleading, and, in this sense, the fundamental limitations of the unstructured model are revealed in Figure 6. The performance of the unstructured model gives reasonable reconstruction across poses seen within the training set, but struggles to generalize to new poses – the more different the pose is from those in the training set, the worse the IoU score.
Piecewise rigid model – “R”
Training the representation in (4) via SGD is effective when the data can truly be modeled by a piecewise rigid decomposition; see Figure 4 (top). When the same network is trained on a dataset which violates this assumption, the learning performance degrades significantly; see Figure 4 (bottom). The rigid animation is recreated exactly, but the blended animation has incorrect blurred boundaries, and is missing portions of the bone indicators. Note that adding more capacity to the network brought no further improvements in performance, as the dataset violates the core assumption of the rigid model.
Piecewise deformable model – “D”
Skinned deformation models give smooth transitions between bones, making a single continuous surface across the range of deformations. Conditioning the network with pose as in (6) allows the network to learn the relative deformation of the parts across poses. When the surface cannot be simply modelled with a piecewise rigid decomposition, the piecewise deformable model performs significantly better. This improvement can clearly be seen by comparing the results of Figure 5 to those of Figure 4. While in interpolation scenarios (replaying the frames of a known animation) the deformable model performs excellently, it struggles when dealing with extrapolation (Figure 6). The extrapolation performance on a realistic 3D dataset (Section 5.2) is better, perhaps because physically correct deformations are not as exaggerated as in our gingerbread example, hence more predictable.
5.2 Analysis on 3D data
The AMASS dataset [26] is a largescale collection of 3D human motion driven by SMPL [23]. In this paper, we use the “DFaust_67” subset of AMASS which has 10 humanoid characters performing different motions. We select the “50002″ subject (see Figure 1) and consider four different sequences. The deformation model of the dataset involves LBS with posespace correctives [23]. As each sequence contains to frames, the overall training dataset contains 3D objects, which is roughly the same as the biggest classes (planes, chairs) of the ShapeNet dataset [3]. Note that the model is trained on a single character and is not expected to generalize across characters, but rather across animation sequences. Further, note that we did not balance the sampling density to focus the network training on small features such as fingers and face details, as these are not animated in AMASS.
Analysis
As visualized in Table 1 and Figure 7, the deformable model is able to achieve high IOU scores with fewer model parameters than would be required with a fully unstructured network. On the AMASS dataset, the rigid model performed well on interpolation tasks, and did not suffer failures analogous to those found in 2D; see Figure 6 – we believe this is due to the fact that our gingerbread character presents nonrigid deformations far beyond those the level that are present in the AMASS dataset. Note how both rigid and deformable models are significantly better than the unstructured baseline in generalizing to unseen poses; see qualitative results in Figure 1, as well as in the supplementary material. Note how the plot in Figure 7 reveals how the rigid model is able to extrapolate much better than the unstructured model. Nonetheless, the deformable model results are still not optimal as the model was unable to properly extrapolate the posedependent correctives for unseen poses.
6 Conclusions
We introduced the problem of geometric modeling of deformable (solid) models from a neural perspective. We showed how unstructured baselines require a significantly larger neural budget compared to structured baselines, but more significantly, they simply fail to generalize. Amongst structured baselines the deformable models performs best at interpolation, while the rigid model leads the extrapolation benchmarks. It would be interesting to understand how to combine these two models and inherit both behaviors. Note that the deformable model (“D”) is still usable in applications as far as the query poses are sufficiently similar to those seen at training time.
Application
Our approach can be applied to a number of problems. These include representation of complex articulated bodies such as human characters, object intersection queries for computer vision registration and tracking, collision detection for computer games and other applications, and compression of mesh sequences. In all these applications neural shape approximation allows different tradeoffs of efficiency vs. detail to be handled using the same general approach.
Future directions
One natural direction for future work would be to reduce the amount of supervision needed. To name a few goals in increasing order of complexity: \⃝raisebox{0.6pt}{1} Can we learn the posing transformations and perhaps also the rest transformations automatically? \⃝raisebox{0.6pt}{2} Can the representation be generalized to capture collections of deformable bodies? (i.e. the parameters of SMPL [23]). \⃝raisebox{0.6pt}{3} Can the signed distance function, rather than occupancy be learnt as well? \⃝raisebox{0.6pt}{4} Is NASA a representation suitable to differentiable rendering? \⃝raisebox{0.6pt}{5} Can a 3D representation of articulated motion be learnt from 2D supervision alone?
7 Acknowledgements
We would like to particularly thank Paul Lalonde for the initial project design, and Gerard PonsMoll for his help accessing the AMASS data. We would also like to thank David I.W. Levin, Alec Jacobson, Hugues Hoppe, Nicholas Vining, Yaron Lipman, and Angjoo Kanazawa for the insightful discussions.
References
 Dragomir Anguelov, Daphne Koller, HoiCheung Pang, Praveen Srinivasan, and Sebastian Thrun. Recovering articulated object models from 3d range data. In Uncertainty in Artificial Intelligence, 2004.
 Stephen W. Bailey, Dave Otte, Paul Dilorenzo, and James F. O’Brien. Fast and deep deformation approximations. SIGGRAPH, 2018.
 Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An informationrich 3d model repository. arXiv:1512.03012, 2015.
 Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. CVPR, 2019.
 Zhiqin Chen, Kangxue Yin, Matthew Fisher, Siddhartha Chaudhuri, and Hao Zhang. Baenet: Branched autoencoder for shape cosegmentation. In ICCV, 2019.
 Angela Dai and Matthias Nießner. Scan2mesh: From unstructured range scans to 3d meshes. In CVPR, 2019.
 Fernando de Goes, Siome Goldenstein, and Luiz Velho. A hierarchical segmentation of articulated bodies. In SGP, 2008.
 Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, and Andrea Tagliasacchi. Cvxnets: Learnable convex decomposition. arXiv:1909.05736, 2019a.
 Boyang Deng, Simon Kornblith, and Geoffrey Hinton. Cerberus: A multiheaded derenderer. arXiv:1905.11940, 2019b.
 Lin Gao, Jie Yang, Tong Wu, YuJie Yuan, Hongbo Fu, YuKun Lai, and Hao Zhang. Sdmnet: deep generative network for structured deformable mesh. ACM TOG, 2019.
 Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. Atlasnet: A papierm^ ach’e approach to learning 3d surface generation. arXiv preprint arXiv:1802.05384, 2018.
 Qixing Huang, Vladlen Koltun, and Leonidas Guibas. Joint shape segmentation with linear programming. ACM TOG, 2011.
 Yani Ioannou, Duncan Robertson, Roberto Cipolla, and Antonio Criminisi. Deep Roots: Improving CNN efficiency with hierarchical filter groups. In CVPR, 2017.
 Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros. Imagetoimage translation with conditional adversarial networks. In CVPR, 2017.
 Alec Jacobson, Ladislav Kavan, and Olga SorkineHornung. Robust insideoutside segmentation using generalized winding numbers. ACM TOG, 2013.
 Alec Jacobson, Zhigang Deng, Ladislav Kavan, and J.P. Lewis. Skinning: Realtime shape deformation. In ACM SIGGRAPH Courses, 2014.
 Doug L. James and Christopher D. Twigg. Skinning mesh animations. SIGGRAPH, 2005.
 Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. Endtoend recovery of human shape and pose. In CVPR, 2018.
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 Binh Huy Le and Zhigang Deng. Smooth skinning decomposition with rigid bones. ACM TOG, 2012.
 J. P. Lewis, Matt Cordner, and Nickson Fong. Pose space deformation: A unified approach to shape interpolation and skeletondriven deformation. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, pages 165–172, New York, NY, USA, 2000. ACM Press/AddisonWesley Publishing Co.
 Ming C. Lin, USA Dinesh Manocha, and Jon Cohen. Collision detection: Algorithms and applications, 1996.
 Matthew Loper, Naureen Mahmood, Javier Romero, Gerard PonsMoll, and Michael J. Black. SMPL: A skinned multiperson linear model. SIGGRAPH Asia, 2015.
 William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. In SIGGRAPH, 1987.
 Dominik Lorenz, Leonard Bereska, Timo Milbich, and Björn Ommer. Unsupervised partbased disentangling of object shape and appearance. arXiv:1903.06946, 2019.
 Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard PonsMoll, and Michael J Black. Amass: Archive of motion capture as surface shapes. ICCV, 2019.
 Stan Melax, Leonid Keselman, and Sterling Orsten. Dynamics based 3d skeletal hand tracking. In Graphics Interface, 2013.
 Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. arXiv:1812.03828, 2018.
 Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. DeepSDF: Learning continuous signed distance functions for shape representation. CVPR, 2019.
 Edoardo Remelli, Anastasia Tkach, Andrea Tagliasacchi, and Mark Pauly. Lowdimensionality calibration through local anisotropic scaling for robust hand model personalization. In ICCV, 2017.
 Hanan Samet. Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. AddisonWesley Longman Publishing Co., Inc., 1990. ISBN 020150300X.
 Ayush Tewari, Michael Zollhofer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Perez, and Christian Theobalt. MoFA: Modelbased deep convolutional face autoencoder for unsupervised monocular reconstruction. In CVPR, 2017.
 Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred neural rendering: Image synthesis using neural textures. arXiv:1904.12356, 2019.
 Julien Valentin, Cem Keskin, Pavel Pidlypenskyi, Ameesh Makadia, Avneesh Sud, and Sofien Bouaziz. Tensorflow graphics: Computer graphics meets deep learning. https://github.com/tensorflow/graphics, 2019.