Learning Spectral Transform Network on 3D Surface for Nonrigid Shape Analysis
Abstract
Designing a network on 3D surface for nonrigid shape analysis is a challenging task. In this work, we propose a novel spectral transform network on 3D surface to learn shape descriptors. The proposed network architecture consists of four stages: raw descriptor extraction, surface secondorder pooling, mixture of power functionbased spectral transform, and metric learning. The proposed network is simple and shallow. Quantitative experiments on challenging benchmarks show its effectiveness for nonrigid shape retrieval and classification, e.g., it achieved the highest accuracies on SHREC’14, 15 datasets as well as the “range” subset of SHREC’17 dataset.
Keywords:
nonrigid shape analysis spectral transform shape representation1 Introduction
3D shape analysis has become increasingly important with the advances of shape scanning and processing techniques. Shape retrieval and classification are two fundamental tasks of 3D shape analysis, with diverse applications in archeology, virtual reality, medical diagnosis, etc. 3D shapes generally include rigid shapes, e.g., CAD models, and nonrigid shapes such as human surfaces with nonrigid deformations.
A fundamental problem in nonrigid shape analysis is shape representation. Traditional shape representation methods are mostly based on local artificial descriptors such as shape context [4], meshsift [39, 22], spin images [19], etc., and they have shown effective performance especially for shape matching and recognition. These descriptors are further modeled as middle level shape descriptors by BagofWords model [21], VLAD [18], etc., and then applied to shape classification and retrieval. For shapes with nonrigid deformations, the model in [11] generalize shape descriptors from Euclidean metrics to nonEuclidean metrics. The spectral descriptors, which are built on spectral decomposition of LaplaceBeltrami operator defined on 3D surface, are popular in nonrigid shape representation. Typical spectral descriptors include diffusion distance [20], heat kernel signature (HKS) [41], wave kernel signature (WKS) [2] and scale invariant heat kernel signature (SIHKS) [6]. In [8], spectral descriptors of SIHKS and WKS using a Large Margin Nearest Neighbor (LMNN) embedding achieved stateoftheart results for nonrigid shape retrieval. Spectral descriptors are commonly intrinsic and invariant to isometric deformations, therefore effective for nonrigid shape analysis.
Recently, a promising trend in nonrigid shape representation is the learningbased methods on 3D surface for tasks of nonrigid shape retrieval and classification. Many learningbased methods take lowlevel shape descriptors as inputs and extract highlevel descriptors by integrating over the entire shape. In the work of [8], they first extract SIHKS and WKS, and then integrate them to form a global descriptor followed by LMNN embedding. Global shape descriptors are learned by LongShort Term Memory (LSTM) network in [45] based on spectral descriptors. The eigenshape and Fishershape descriptors are learned by a modified autoencoder based on spectral descriptors in [12]. These works have shown impressive results in learning global shape descriptors. Though these advances have been achieved, designing learningbased methods on 3D surface is still an emerging and challenging task, including how to design feature aggregation and feature learning on 3D surface for nonrigid shape representation.
In this work, we propose a novel learningbased spectral transform network on 3D surface to learn discriminative shape descriptor for nonrigid shape retrieval and classification. First, we define a secondorder pooling operation on 3D surface which models the secondorder statistics of input raw descriptors on 3D surfaces. Second, considering that the pooled secondorder descriptors lie on a manifold of symmetric positive definite matrices (SPDMmanifold), we define a novel manifold transform for feature learning by learning a mixture of power function on the singular values of the SPDM descriptors. Third, by concatenating the stages of raw descriptor extraction, surface secondorder pooling, transform on SPDMmanifold and metric learning, we propose a novel network architecture, dubbed as spectral transform network as shown in Fig. 1, which can learn discriminative shape descriptors for nonrigid shape analysis.
To the best of our knowledge, this is the first paper that learns secondorder poolingbased shape descriptors on 3D surfaces using a network architecture. Our network structure is simple and easily to be trained, and is justified to be able to significantly improve the discriminative ability of input raw descriptors. It is adaptive to various nonrigid shapes such as watertight meshes, partial meshes and point cloud data. It achieved competitive results on challenging benchmarks for nonrigid shape retrieval and classification, e.g., accuracy on SHREC’14 [32] dataset and the stateoftheart accuracy on the “range” subset of SHREC’17 [28] in metric of NN [38].
2 Related Works
2.1 Learning approach for 3D shapes.
Deep learning is a powerful tool in computer vision, speech recognition, natural language processing, etc. Recently, it has also been extended to 3D shape analysis and achieves impressive progresses. One way for the extension is to represent the shapes as volume data [44, 35] or multiview data [3] and then send them to deep neural networks. The voxel and multiview based shape representation have been successful in rigid shape representation [35, 40] relying on a large training dataset. Due to the operations of voxelization and 3D to 2D projection, it may lose shape details especially for nonrigid shapes with large deformations, e.g., human bodies with different poses. An alternative way is to define the networks directly on 3D surface based on spectral descriptors as in [45, 12, 5, 27]. These models benefit from the intrinsic properties of spectral descriptors, and utilize surface convolution or deep nueral networks, e.g., LSTM, autoencoder, to further learn discriminative shape descriptors. PointNet [34, 36] is another interesting deep learning approach that directly build network using the point cloud representation of 3D shapes, which can also handle the nonrigid 3D shape classification using nonEuclidean metric. Compared with them, we build a novel network architecture on 3D surface from the perspectives of secondorder descriptor pooling and spectrum transform on the pooled descriptors. It is justified to be able to effectively learn surface descriptors on SPDMmanifold.
2.2 Secondorder pooling of shape descriptors.
Secondorder pooling operation was firstly proposed in [7] showing outstanding performance in 2D vision tasks such as recognition [16] and segmentation [7]. The pooled descriptors lie on a Riemannian SPDMmanifold. Due to nonEuclidean structure of this manifold, many traditional machine learning methods based on Euclidean metrics can not be used directly. As discussed in [1, 31], two popular metrics on SPDMmanifold are affineinvariant metric and logEuclidean metric. Considering the complexity, the logEuclidean metric and its variants [17, 15] that embed data into Euclidean space are more widely used [7, 43]. The powerEuclidean transform [10] has achieved impressive results which theoretically approximates the logEuclidean metric when its power index approaches zero. The most related shape descriptors to ours for 3D shape analysis are the covariancebased descriptors [9, 43]. In [9], they encoded the point descriptors such as angular and texture within a 3D point neighbourhood by a covariance matrix. In [43], the covariance descriptors were further incorporated into the BagofWords model to represent shapes for retrieval and correspondence. In our work, we present a formal definition of secondorder pooling of shape descriptors on 3D surface, and define a learningbased spectral transform on SPDMmanifold, which can effectively boost the performance of the pooled descriptors for 3D nonrigid shape analysis.
In the following sections, we first introduce our proposed spectral transform network in Sect. 3. Then, in Sect. 4, we experimentally justify the effectiveness of the proposed network on benchmark datasets for nonrigid shape retrieval and classification. We finally conclude this work in Sect. 5.
3 Spectral Transform Network on 3D Shapes
We aim to learn discriminative shape descriptors for 3D shape analysis by designing a spectral transform network (STNet) on 3D surface. As illustrated in Fig. 1, our approach consists of four stages: raw descriptor extraction, surface secondorder pooling, SPDMmanifold transform and metric learning. In the followings, we will give detailed descriptions of these stages.
3.1 Raw descriptor extraction
Let denote the surface (either mesh or point cloud) of a given shape, in this stage, we extract descriptors from . For watertight surface, we select spectral descriptors, i.e., SIHKS [6] and WKS [2] as inputs, which are intrinsic and robust to nonrigid deformations. For partial surface and point cloud, we choose local geometric descriptors such as Localized Statistical Features(LSF) [29]. All of them are dense descriptors representing multiscale geometric features of the shape. Note that our framework is generic, and other shape descriptors can also be used such as normals and curvatures.
Spectral descriptors. Spectral descriptors are mostly dependent on the spectral (eigenvalues and/or eigenfunctions) of the LaplaceBeltrami operator, and they are well suited for the analysis of nonrigid shapes. Popular spectral descriptors include HKS [41], SIHKS [6] and WKS [2]. Derived from heat diffusion process, HKS [41] reflects the amount of heat remaining at a point after certain time. SIHKS [6] is derived from HKS and it is scaleinvariant. Both of them are intrinsic but lack spatial localization capability. WKS [2] is another intrinsic spectral descriptor stemming from Schrödinger equation. It evaluates the probability of a quantum particle on a shape to be located at a point under a certain energy distribution, and it is better for spatial localization.
Local geometric descriptors. Another kind of raw shape descriptor is local geometric descriptor, which encodes the local geometric and spatial information of the shape. We select LSF [29] as input for partial and point cloud nonrigid shape analysis, and it encodes the relative positions and angles locally on the shape. Assuming the selected point is , its position and normal vector are and , another point with associated position and normal vector as and is within the sphere of influence in a radius of . Then a 4tuple is computed as:
(1) 
where . For a local shape of points, a set of 4tuples are computed for the center point, which are collected into a 4dimensional joint histogram. By dividing the histogram to 5 bins for each dimension of the tuple, we have a 625d descriptor for the center point, which encodes the local geometric information around it.
Given the surface , we extract either spectral or geometric shape descriptors called as raw descriptors for each point , denoted as , which are taken as the inputs of the following stage.
3.2 Surface secondorder pooling
In this stage, we generalize the secondorder averagepooling operation [7] from 2D image to 3D surface, and propose a surface secondorder pooling operation. Given the extracted shape descriptors , the surface secondorder pooling is defined as:
(2) 
where denotes the area of the surface, is the secondorder descriptor for a point , and is a matrix of the pooled secondorder descriptor on , which is taken as the output of this stage.
For the surface represented by discretized irregular triangular mesh, the integral operation in Eq. (2) can be descretized considering the Voronoi area around each point:
(3) 
where denotes a discretized point on with its Voronoi area as . In our work, we compute as in [33]. For the shapes composed of point cloud, Eq. (2) can be descretized as average pooling of the secondorder information:
(4) 
where denotes the number of points on the surface.
The pooled secondorder descriptors represent 2ndorder statistics of raw descriptors over the 3D surfaces. It is obvious that is a symmetric positive definite matrix (SPDM), which lies on a nonEuclidean manifold of SPDM.
3.3 SPDMmanifold transform
This stage, i.e., SPDMT stage, performs nonlinear transform on the singular values of the pooled secondorder descriptors, and it is in fact a spectral transform on the SPDMmanifold. This transform will be discriminatively learned for specific task enforced by the loss in the next metric learning stage.
Forward computation. Assuming that we have a symmetric positive definite matrix , by singular value decomposition, we have:
(5) 
We first normalize the singular values of , i.e., the diagonal values of , by normalization, achieving , where is the number of singular values, then perform nonlinear transform on . Inspired by polynomial function, we propose the following transform:
(6) 
where is a diagonal matrix with input elements as its diagonal values, is a mixture of power function:
(7) 
where are samples with uniform intervals in range of , is a vector of combination coefficients and required to satisfy:
(8) 
To meet these requirements, the coefficients are defined as:
(9) 
Then we instead learn the parameters in to determine .
After this transform, a new singular value matrix is derived. Combining it with the original singular vector matrix , we get the transformed descriptor as:
(10) 
is also a symmetric positive definite matrix. Due to the symmetry of , the elements of its upper triangular are kept as the output of this stage, where is an operator vectorizing the upper triangular elements of a matrix.
Backward propagation. As proposed in [17], matrix backpropagation can be performed for SVD decomposition. Let denote an operator on matrix that sets all nondiagonal elements to , be an operator of vectorizing the diagonal elements of a matrix, be the inverse operator of , be the Hadamard product operator. For backward propagation, assuming the partial derivative of loss with respect to as , we have:
(11) 
(12) 
(13) 
The partial derivative of loss function with respect to the parameter can be derived by successively computing Eqs. (11), (12), (13). Please refer to supplementary material for gradient computations.
Analysis of SPDMT stage. The pooled secondorder descriptor lies on the SPDMmanifold, and the popular transform on this manifold is logEuclidean transform [10], i.e., . However, it is unstable when the singular values of are near or equal to zero. The logarithmbased transforms such as [17] and [15] are proposed to overcome this unstability, but they need a positive constant regularizer which is difficult to set. The powerEuclidean metric [10] theoretically approximates the logEuclidean metric when its power index approaches zero while being more stable. Our proposed mixture of power function is an extension of powerEuclidean transform that takes it as a special case. The SPDMT stage learns an effective transform in the space spanned by the power functions adaptively using a datadriven approach. Furthermore, the mixture of power function is constrained to be nonlinear and retains nonnegativeness and order of the eigenvalues (i.e., singular values of a symmetric matrix).
From a statistical perspective, can be seen as a covariance matrix of input descriptors on 3D surface. Geometrically, its eigenvectors in columns of construct a coordinate system, its eigenvalues reflect feature variances projected to eigenvectors. By transforming these projected variances (eigenvalues), implicitly tunes the statistics distribution of input raw descriptors in pooling region when training. Since the entropy of Gaussian distribution with covariance is , transforming eigenvalues by implicitly tune the entropy of distribution of raw descriptors on 3D surface.
3.4 Metric learning
With the transformed descriptors as input, we embed them into a lowdimensional space where the descriptors are well grouped or separated with the guidance of labels. To prove the effectiveness of the SPDMT stage, we design a shallow neural network to achieve the metric learning stage. We first normalize the input by normalization, achieving , then add a fully connected layer:
(14) 
where is a matrix in size of . is the descriptor of the whole shape. We further send into loss function for specific shape analysis task to enforce the discriminative ability of shape descriptor. In this work, we focus on shape retrieval and classification. We next discuss the loss functions.
Shape retrieval. Given a training set of shapes, the loss for shape retrieval is defined on all the possible triplets of shape descriptors , where is the index of triplet, and are two shape descriptors with same and different labels w.r.t. the target shape descriptor respectively:
(15) 
where is the number of triplets in , is norm, is the margin, and is a constant to balance these two terms.
Shape classification. We construct the crossentropy loss for shape classification. Given the learned descriptor with their corresponding labels as , we first add a fully connected layer after to map the features to scores for different categories, and then followed by a softmax layer to predict the probability of a shape belonging to different categories, and the probabilities of all training shapes are denoted as . The loss function is defined as:
(16) 
where is the number of training shapes, and indicate the shape and category respectively, and is the total number of categories.
The combination of fully connected layer and loss function results in a metric learning problem. Minimizing the loss function embeds the shape descriptors into a lowerdimensional space, in which the learned shape descriptors are enforced to be discriminative for specific shape analysis task.
3.5 Network training
For the task of shape retrieval, each triplet of shapes is taken as a training sample and multiple triplets are taken as a batch for training with minibatch stochastic gradient descent optimizer (SGD). For shape classification, the network is also trained by minibatch SGD. To train the network, the raw descriptor extraction and secondorder pooling stage as well as the SVD decomposition can be computed offline, and the learnable parameters in STNet are in SPDMT stage and in metric learning stage and the later fully connected layer (for classification). For the nonlinear transform in SPDMT stage, we set , . The gradients of loss are backpropagated to the SPDMT stage.
4 Experiments
In this section, we evaluate the effectiveness of our STNet, especially the surface secondorder pooling and SPDMT stages, for 3D nonrigid shape retrieval and classification. We test our model on watertight and partial mesh datasets as well as point cloud dataset. We will successively introduce the datasets, evaluation methodologies, quantitative results and the evaluation of our SPDMT stage.
4.1 Datasets and evaluation methodologies
Considering that our network is designed for nonrigid shape analysis, we evaluate it for shape retrieval on SHREC’14 [32] and SHREC’17 [28] datasets, and we test our architecture for shape classification on SHREC’15 [23] dataset. All of them are composed of nonrigid shapes with various deformations.
SHREC’14. This dataset includes two datasets of Real and Synthetic human data respectively, both of which are composed of watertight meshes. The Real dataset comprises of meshes from human subjects in different poses. The Synthetic dataset consists of 15 human subjects with poses, resulting in a dataset of 300 shapes. We will try three following experimental settings, which will be refered as “setting” (). In setting, and shapes are used for training and test respectively as in [8]. In setting, an independent training set
SHREC’15. This dataset includes 1200 watertight shapes of 50 categories, each of which contains 24 shapes with various poses and topological structures. To compare with the stateoftheart PointNet++ [36] on the dataset, we use the experimental setting in [36], i.e., treating the shapes as point cloud data, and using 5fold crossvalidation to test the accuracy for shape classification.
SHREC’17. This dataset is composed of two subsets, i.e., “holes” and “range”, which contain meshes with holes and range data respectively. We use the provided standard splits for training / test. The “holes” subset consists of 1216 training and 1078 test shapes, and the “range” subset consists of 1082 training and 882 test shapes.
Evaluation methodologies. For nonrigid shape retrieval, we evaluate results by NN (Nearest Neighbor), 1T (FirstTier), 2T (SecondTier), and DCG (Discounted Cumulative Gain) [38]. For nonrigid shape classification, the results are evaluated by classification accuracy, i.e., the percentage of correctly classified shapes.
4.2 Results for nonrigid shape retrieval on watertight dataset
Method  Synthetic  Real  

NN  1T  2T  DCG  NN  1T  2T  DCG  
CSDLMNN [8]  99.7  98.0  99.9  99.6  97.9  92.8  98.7  97.6 
SurfO  82.7  77.1  84.3  83.6  54.2  52.6  57.9  55.0 
SurfO  87.3  84.2  89.2  87.8  61.1  57.2  64.0  63.2 
SurfOML  100  96.9  99.9  99.7  96.7  91.9  98.3  96.9 
SurfOML  100  100  100  100  98.8  96.1  99.6  99.9 
STNet  100  100  100  100  100  99.8  100  99.9 
For nonrigid shape retrieval on SHREC’14 datasets, we select SIHKS and WKS as input raw descriptors. We discretize the LaplaceBeltrami operator as in [33], and compute d SIHKS and d WKS. In the surface secondorder pooling stage, the descriptors are computed by Eq. (3). When training the STNet for shape retrieval, the batch size, learning rate and margin are set as , and respectively. The descriptor of every shape is d, i.e., . In the loss function, is set as .
To justify the effectiveness of our architecture, we compare the following different variants of descriptors for shape retrieval. (1) SurfO: pooled raw descriptors on surfaces. (2) SurfO: pooled secondorder descriptors on surfaces. (3) SurfOML: descriptors of SurfO followed by a metric learning stage. (4) SurfOML: descriptors of SurfO followed by a metric learning stage. For retrieval task, the descriptors of SurfO and SurfO are directly used for retrieval based on Euclidean distance. In Table 1, we report the results in experimental setting1 of these descriptors as well as stateoftheart CSDLMNN [8] method. As shown in the table, the increased accuracies from SurfO to SurfO, and that from SurfOML to SurfOML indicate the effectiveness of the surface secondorder pooling stage. The improvements from SurfOML to STNet demonstrate the advantage of the SPDMT stage. Our full STNet achieves 100% accuracy in NN (i.e., the percentage of retrieved nearest neighbor shapes belonging to the same class as queries) on SHREC’14 Synthetic and Real datasets. Compared with stateoftheart CSDLMNN [8] method, the competitive accuracies justify the effectiveness of our method.
Table 2 presents results in mean average precision (mAP) on SHREC’14 Real and Synthetic datasets in setting1 compared with RMVM [14], CSDLMNN [8], in which CSDLMNN is a stateoftheart approach for this task. For our proposed STNet, we randomly split the training and test subsets five times and report the average mAP with standard deviations shown in brackets. In the table, we also present the baseline results of our STNet to justify the effectiveness of our architecture. Our STNet achieves highest mAP on both datasets, demonstrating its effectiveness for watertight nonrigid shape analysis. In Table 3, we also show the results on SHREC’14 Real dataset using setting2 and setting3, which are more challenging since the training and test sets have disjoint shape categories. STNet still significantly outperforms the baselines of SurfOML and SurfOML, and achieves high accuracies. Our STNet significantly outperforms Litman [32] using the experimental setting3.
Dataset  RMVM[14]  CSDLMNN[8]  SurfO  SurfO  SurfOML  SurfOML  STNet 

Synthetic  96.3  99.7  82.7  85.3  93.6  97.1  100(0) 
Real  79.5  97.9  50.8  59.4  90.5  95.3  99.9(0.1) 
method  NN  1T  2T  EM  DCG  
Setting2  SurfOML  45.75  35.25  59.08  34.76  63.84 
SurfOML  80.25  63.94  78.39  40.94  80.21  
STNet  85.75  71.33  88.92  43.20  88.29  
Setting3  Litman[32]  79.3  72.7  91.4  43.2  89.1 
SurfOML  54.29  46.67  70.91  37.60  71.71  
SurfOML  88.76  82.01  96.47  42.82  90.75  
STNet  92.53(1.49)  84.78(2.43)  96.93(1.17)  43.86(0.36)  93.85(1.63) 
4.3 Results for nonrigid shape retrieval on partial dataset
We now evaluate our approach on nonrigid partial shapes in subsets of “holes” and “range” of SHREC’17 dataset. Considering that the shapes are not watertight surface, we use local geometric descriptors as raw descriptors. We select 3000 points uniformly as in [30] for every shape and compute 625d LSF as inputs. In the surface secondorder pooling stage, the descriptors are computed by Eq. (4). When training the STNet for shape retrieval, the batch size, learning rate and margin are set as , and respectively. The descriptor of every shape is d, i.e., . In the loss function, the constant is set as .
In Table 4, we first compare our STNet with the baselines, i.e., SurfO, SurfO, SurfOML and SurfOML to show the effectiveness of our network architecture. The raw LSF descriptors after secondorder pooling without learning, i.e., SurfO, produces the accuracy of 69.0% and 71.7% in NN on “holes” and “range” subsets respectively. With the metric learning stage, the results are increased to 75.9% and 77.6% for these two subsets. The STNet with both SPDMT stage and metric learning stage increases the accuracies to be 96.1% and 97.3% respectively, with around 20 percent improvement than the results of SurfOML. This clearly shows the effectiveness of our defined SPDMT stage for enhancing the discriminative ability of shape descriptors. Moreover, the performance increases from SurfO to SurfO and from SurfOML to SurfOML justify the effectiveness of surface secondorder pooling stage over the traditional firstorder pooling (i.e., average pooling) on surfaces.
In Table 4, we also compare our results with the methods that participate to the SHREC’17 track, i.e., DLSF [13], 2VDICNN, SBoF [25], and BoW+RoPS, BoW+HKS, RMVM[14], DNA [37], etc., and the results of their methods are from [28]. In DLSF [13], they design their deep network by first training a Eblock and then training a ACblock. The method of 2VDICNN is based on multiview projections of 3D shapes and deep GoogLeNet[42] for shape descriptor learning. The method of SBoF [25] trains a bagoffeatures model using sparse coding. The methods of BoW+RoPS, BoW+HKS combine BoW model with shape descriptors of RoPS and HKS for shape retrieval. As shown in the table, our STNet ranks second on the “holes” subset and achieves the highest accuracy in NN, 2T and DCG on the “range” subset, demonstrating its effectiveness for partial nonrigid shape analysis. Compared with the deep learningbased methods of DLSF [13], 2VDICNN, our network architecture is shallow and simple to implement but achieves competitive performance.
In Figure 2, we show the top retrieved shapes with interval of 5 in ranking index given query shapes on the leftmost column, and the examples in subfigures (a) and (b) are respectively from SHRES’17 “holes” and “range” subsets. These shapes are with large nonrigid deformations. The examples show that STNet enables to effectively retrieve the correct shapes even when the shapes are range data or with large holes.


4.4 Results for nonrigid shape classification on point cloud data
In this section, we mainly aim to compare our approach with stateoftheart deep network of PointNet++ [36] for nonrigid shape classification by point cloud representation. We compare on SHREC’15 nonrigid shape dataset for classification, and PointNet++ reported stateoftheart results on this dataset. For every shape, we uniformly sample 3000 points as [30], and take the 625d LSF as raw descriptors. The secondorder descriptors are pooled by Eq. (4). When training the STNet, the batch size, learning rate are set as and .
We compare STNet with baseline architectures of SurfO, SurfO, SurfOML, SurfOML in Table 5. For the STNet, we perform 5fold crossvalidation for 6 times (each time using a different train/test split), and the average accuracy is reported with standard deviation shown in bracket. For classification task, the descriptors are sent to classification loss (see Sect. 3.4) for classifier training. The raw descriptors using average pooling, i.e., SurfO, achieves 85.58% in classification accuracy. Our STNet achieves accuracy, which shows the effectiveness of our network architecture.
In Table 5, we also present the classification accuracies of deep learningbased methods of DeepGM [26] and PointNet++ [36]. DeepGM [26] learns deep features from geodesic moments by stacked autoencoder. PointNet++ [36] is pioneering work for deep learning on point clouds based on a well designed PointNet [34] recursively on a nested partition of the input point set. Compared with the stateoftheart method of PointNet++ [36], the classification accuracy of our STNet is higher.
DeepGM [26]  PointNet++ [36]  SurfO  SurfO  SurfOML  SurfOML  STNet  

Acc  93.03  96.09  85.58  89.84  91.00  93.08  97.37(0.97) 
4.5 Evaluation for SPDMmanifold transform
SPDMT is an essential stage in our STNet. Besides the analysis in Sect. 3.3, we evaluate and visualize the learned SPDMmanifold transform in this subsection.
We first evaluate the effects of different transforms in the SPDMT stage quantitatively on SHREC’14 Real dataset in setting3. These transforms include powerEuclidean (pE) [10], i.e., , and logarithmbased transforms: (LE) [10], (LR) [17] and (LMR) [15]. Besides the transforms mentioned above, we also present the results of  Normalization (N) and Signed Squareroot + Normalization (SSN) [24]. These compared results in Table 6 are produced by STNet with fixed as these transforms. The results are measured by NN and T. It is shown that our learned transform achieves the best results. Some of these compared transforms, such as LE, LR, LMR, perform well in the training set but worse in the test set, and our proposed transform prevents overfitting.
We then visually show the learned transform function in Fig. 3. We draw the curves of learned and show examples of pooled secondorder descriptors before and after transform using on SHREC’14 Real dataset for retrieval (Fig. 3(a)) and SHREC’15 dataset for classification (Fig. 3(b)). As shown in the curves, our learned increases the eigenvalues and increases more on the smaller eigenvalues of the pooled secondorder descriptors. According to the analysis in Sect. 3.3, the learned transform increases the entropy of distribution of input raw descriptors, resulting in more discriminative shape descriptors using metric learning as shown in the experiments. In each subfigure of Fig. 3, compared with traditional fixed transforms, our net can adaptively learn transforms for different tasks by discriminative learning. We also show the pooled secondorder descriptors before (upperright images) and after (lowerright images) the transform of in the subfigures, and the values around diagonal elements are enhanced after transform.
Metric  LE [10]  LR [17]  LMR [15]  N  SSN [24]  pE [10]  Proposed  

NN  train  100  93.33  95.00  95.83  97.50  99.17  98.33 
test  61.07  65.00  65.36  82.14  84.64  88.57  92.50  
1T  train  93.15  82.04  83.61  87.69  95.37  96.48  96.67 
test  53.17  49.40  50.36  70.20  75.32  77.30  85.20 
5 Conclusion
In this paper, we proposed a novel spectral transform network for 3D shape analysis based on surface secondorder pooling and spectral transform on SPDMmanifold. The network is simple and shallow. Extensive experiments on benchmark datasets show that it can significantly boost the discriminative ability of input shape descriptors, and generate discriminative global shape descriptors achieving or matching stateoftheart results for nonrigid shape retrieval and classification on diverse benchmark datasets.
In the future work, we are interested to design an endtoend learning framework including the raw descriptor extraction on 3D meshes or point clouds. Furthermore, we can also possibly pack the surface secondorder pooling stage, SPDMT stage and fully connected layer as a block, and add multiple blocks for building a deeper architecture.
Acknowledgement. This work is supported by National Natural Science Foundation of China under Grants 11622106, 61711530242, 61472313, 11690011, 61721002.
Footnotes
 email: yuruixuan123@stu.xjtu.edu.cn,{jiansun,huibinli}@xjtu.edu.cn
 email: yuruixuan123@stu.xjtu.edu.cn,{jiansun,huibinli}@xjtu.edu.cn
 email: yuruixuan123@stu.xjtu.edu.cn,{jiansun,huibinli}@xjtu.edu.cn
 http://www.cs.cf.ac.uk/shaperetrieval/download.php
References
 Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Geometric means in a novel vector space structure on symmetric positivedefinite matrices. SIAM J. Matrix Anal. Appl. 29(1), 328–347 (2007)
 Aubry, M., Schlickewei, U., Cremers, D.: The wave kernel signature: A quantum mechanical approach to shape analysis. In: ICCV. pp. 1626–1633 (2011)
 Bai, S., Bai, X., Zhou, Z., Zhang, Z., Tian, Q., Latecki, L.J.: Gift: Towards scalable 3d shape retrieval. IEEE Transactions on Multimedia 19(6), 1257–1271 (2017)
 Belongie, S., Malik, J., Puzicha, J.: Shape context: A new descriptor for shape matching and object recognition. In: NIPS. pp. 831–837 (2001)
 Boscaini, D., Masci, J., Rodolà, E., Bronstein, M.: Learning shape correspondence with anisotropic convolutional neural networks. In: NIPS. pp. 3189–3197 (2016)
 Bronstein, M.M., Kokkinos, I.: Scaleinvariant heat kernel signatures for nonrigid shape recognition. In: CVPR. pp. 1704–1711 (2010)
 Carreira, J., Caseiro, R., Batista, J., Sminchisescu, C.: Semantic segmentation with secondorder pooling. ECCV pp. 430–443 (2012)
 Chiotellis, I., Triebel, R., Windheuser, T., Cremers, D.: Nonrigid 3d shape retrieval via large margin nearest neighbor embedding. In: ECCV. pp. 327–342 (2016)
 Cirujeda, P., Mateo, X., Dicente, Y., Binefa, X.: Mcov: A covariance descriptor for fusion of texture and shape features in 3d point clouds. In: 3DV. pp. 551–558 (2015)
 Dryden, I.L., Koloydenko, A., Zhou, D.: Noneuclidean statistics for covariance matrices, with applications to diffusion tensor imaging. The Annals of Applied Statistics 3(3), 1102–1123 (2009)
 Elad, A., Kimmel, R.: On bending invariant signatures for surfaces. IEEE TPAMI 25(10), 1285–1295 (2003)
 Fang, Y., Xie, J., Dai, G., Wang, M., Zhu, F., Xu, T., Wong, E.: 3d deep shape descriptor. In: CVPR. pp. 2319–2328 (2015)
 Furuya, T., Ohbuchi, R.: Deep aggregation of local 3d geometric features for 3d model retrieval. In: BMVC. pp. 121.1–121.12 (2016)
 Gasparetto, A., Torsello, A.: A statistical model of riemannian metric variation for deformable shape analysis. In: CVPR. pp. 1219–1228 (2015)
 Huang, Z., Van Gool, L.: A riemannian network for spd matrix learning. In: AAAI (2017)
 Ionescu, C., Carreira, J., Sminchisescu, C.: Iterated secondorder label sensitive pooling for 3d human pose estimation. In: CVPR. pp. 1661–1668 (2014)
 Ionescu, C., Vantzos, O., Sminchisescu, C.: Matrix backpropagation for deep networks with structured layers. In: ICCV. pp. 2965–2973 (2015)
 Jegou, H., Douze, M., Schmid, C., Perez, P.: Aggregating local descriptors into a compact image representation. In: CVPR. pp. 3304–3311 (2010)
 Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. IEEE TPAMI 21(5), 433–449 (1999)
 Lafon, S., Keller, Y., Coifman, R.R.: Data fusion and multicue data matching by diffusion maps. IEEE TPAMI 28(11), 1784–1797 (2006)
 Li, B., Lu, Y., Li, C., et al.: A comparison of 3d shape retrieval methods based on a largescale benchmark supporting multimodal queries. CVIU 131(C), 1–27 (2015)
 Li, H., Huang, D., Morvan, J., Wang, Y., Chen, L.: Towards 3d face recognition in the real: A registrationfree approach using finegrained matching of 3d keypoint descriptors. IJCV 113(2), 128–142 (2015)
 Lian, Z., Zhang, J., et al.: Shrec’15 track: Nonrigid 3d shape retrieval. Eurographics 3DOR Workshop (2015)
 Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear cnn models for finegrained visual recognition. In: ICCV. pp. 1449–1457 (2015)
 Litman, R., Bronstein, A., Bronstein, M., Castellani, U.: Supervised learning of bagoffeatures shape descriptors using sparse coding. CGF 33(5), 127–136 (2014)
 Luciano, L., Hamza, A.B.: Deep learning with geodesic moments for 3d shape classification. Pattern Recognition Letters (In press) (2017)
 Masci, J., Boscaini, D., Bronstein, M., Vandergheynst, P.: Geodesic convolutional neural networks on riemannian manifolds. In: ICCV. pp. 832–840 (2015)
 Masoumi, M., Rodola, E., Cosmo, L.: Shrec’17 track: Deformable shape retrieval with missing parts. In: Eurographics 3DOR Workshop (2017)
 Ohkita, Y., Ohishi, Y., Furuya, T., Ohbuchi, R.: Nonrigid 3d model retrieval using set of local statistical features. In: IEEE International Conference on Multimedia and Expo Workshops. pp. 593–598 (2012)
 Osada, R., Funkhouser, T., Chazelle, B., Dobkin, D.: Shape distributions. ACM TOG 21(4), 807–832 (2002)
 Pennec, X., Fillard, P., Ayache, N.: A riemannian framework for tensor computing. IJCV 66(1), 41–66 (2006)
 Pickup, D., Sun, X., Rosin, P.L., et al.: Shrec’14 track: shape retrieval of nonrigid 3d human models. In: Eurographics 3DOR Workshop (2014)
 Pinkall, U., Polthier, K.: Computing discrete minimal surfaces and their conjugates. Experimental Mathematics 2(1), 15–36 (1993)
 Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. CVPR, IEEE (2017)
 Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multiview cnns for object classification on 3d data. In: CVPR. pp. 5648–5656 (2016)
 Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: NIPS, pp. 5099–5108 (2017)
 Reuter, M., Wolter, F., Peinecke, N.: Laplacebeltrami spectra as ’shapedna’ of surfaces and solids. Computeraided Design 38(4), 342–366 (2006)
 Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The princeton shape benchmark. In: IEEE International Conference on Shape Modeling and Applications. pp. 167–178 (2004)
 Smeets, D., Keustermans, J., Vandermeulen, D., Suetens, P.: meshsift: Local surface features for 3d face recognition under expression variations and partial data. CVIU 117(2), 158–169 (2013)
 Su, H., Maji, S., Kalogerakis, E., Learnedmiller, E.G.: Multiview convolutional neural networks for 3d shape recognition. ICCV pp. 945–953 (2015)
 Sun, J., Ovsjanikov, M., Guibas, L.: A concise and provably informative multiscale signature based on heat diffusion. In: CGF. pp. 1383–1392 (2009)
 Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CVPR pp. 1–9 (2015)
 Tabia, H., Laga, H., Picard, D., Gosselin, P.H.: Covariance descriptors for 3d shape matching and retrieval. In: CVPR. pp. 4185–4192 (2014)
 Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: CVPR. pp. 1912–1920 (2015)
 Zhu, F., Xie, J., Fang, Y.: Heat diffusion longshort term memory learning for 3d shape analysis. In: ECCV. pp. 305–321 (2016)