SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters
Abstract
Deep neural networks have enjoyed remarkable success for various vision tasks, however it remains challenging to apply CNNs to domains lacking a regular underlying structures such as 3D point clouds. Towards this we propose a novel convolutional architecture, termed SpiderCNN, to efficiently extract geometric features from point clouds. SpiderCNN is comprised of units called SpiderConv, which extend convolutional operations from regular grids to irregular point set that can be embedded in , by parametrizing a family of convolutional filters. We elaborately design the filter as a product of simple step function that captures local geodesic information and a Taylor polynomial that ensures the expressiveness. SpiderCNN inherits the multiscale hierarchical architecture from the classical CNNs, which allows it to extract semantic deep features. Experiments on ModelNet40 demonstrate that SpiderCNN achieves thestateofthe art accuracy on standard benchmarks, and shows competitive performance on segmentation task.
Keywords:
Convolutional neural network, parametrized convolutional filters, point clouds1 Introduction
\@footnotetextWe will release the source code on github soon. \@footnotetext These two authors contribute equally. This work is done during the first author’s internship at Shenzhen Institutes of Advanced Technology, CAS \@footnotetext Corresponding author
Convolutional neural networks are powerful tools for analyzing data that can naturally be represented as signals on regular grids, such as audio and images [1]. Thanks to the translation invariance of lattices in , the number of parameters in a convolutional layer is independent of the input size. Composing convolution layers and activation functions results in a multiscale hierarchical learning pattern, which is shown to be very effective for learning deep representations in practice.
With the recent proliferation of applications employing 3D depth sensors [2] such as autonomous navigation, robotics and virtual reality, there is an increasing demand for algorithms to efficiently analyze point clouds. However, point clouds are distributed irregularly in , lacking a canonical order and translation invariance, which prohibits using CNNs directly. One may circumvent this problem by converting point clouds to 3D voxels and apply 3D convolutions [3]. However, the volumetric methods are computationally inefficient because point clouds are sparse in 3D as they usually represent 2D surfaces. Although there are studies that improve the computational complexity, it may come with a performance trade off [4] [5]. Various studies are devoted to make convolution neural networks applicable for learning on nonEuclidean domains such as graphs or manifolds by trying to generalize the definition of convolution to functions on manifolds or graphs, enriching the emerging field of geometric deep learning [6]. However, it is challenging theoretically because convolution cannot be naturally defined when the space does not carry a group action, and when the input data consists of different shapes or graphs, it is difficult to make a choice for convolutional filters. ^{1}^{1}1There is no canonical choice of a domain for these filters.
In light of the above challenges, we propose an alternative convolutional architecture, SpiderCNN, which is designed to directly extract features from point clouds, and validate its effectiveness on classification and segmentation benchmarks. One of our motivations comes from the continuous and parametric form of filters used in signal processing. By discretizing the integral formula of convolution in mathematics as shown in Figure 1, and using a special family parametrized nonlinear functions on as filters, we introduce a novel convolutional layer, SpiderConv, for point clouds.
The family of filters is designed to be expressive while still being feasible to optimize. We combine simple step functions, which are used to capture the coarse geometry described by local geodesic distance, with order3 Taylor expansions, which ensure the filters are complex enough to capture delicate local geometric variations. Experiments in Section 4 show that SpiderCNN with a relatively simple network architecture achieves the stateoftheart performance for classification on ModelNet40 [7], and shows competitive performance for segmentation on ShapeNetPart [7].
2 Related Work
First we discuss deep neural network based approaches that target point clouds data. Second, we give a partial overview of geometric deep learning.
2.1 Point clouds as input
PointNet [8] is a pioneering work in using deep networks to directly process point sets. A spatial encoding of each point is learned through a shared MLP, and then all individual point features aggregate to a global signature through maxpooling, which is a symmetric operation that doesn’t depend on the order of input point sequence.
While PointNet works well to extract global features, its design limits its efficacy at encoding local structures. Various studies addressing this issue propose different grouping strategies of local features in order to mimic the hierarchical learning procedure at the core of classical convolutional neural networks. PointNet++ [9] uses iterative farthest point sampling to select centroids of local regions, and PointNet to learn the local pattern. KdNetwork [10] subdivides the space using Kd trees, whose hierarchical structure serves as the instruction to aggregate local features at different scales. In SpiderCNN, no additional choice for grouping or sampling is needed, for our filters handle the issue automatically.
The idea of using permutationinvariant functions for learning on unordered sets is further explored by DeepSet[11]. We note that the output of SpiderCNN does not depend on the input order by design.
2.2 Voxels as input
VoxNet [3] and VoxceptionResNet [5] apply 3D convolution to a voxelization of point clouds. However, there is a high computational and memory cost associated with 3D convolutions. A variety of work [4] [12] [13] has aimed at exploiting the sparsity of voxelized point clouds to improve the computational and memory efficiency. OctNet [4] modified and implemented convolution operations to suit a hybrid gridoctree data structure. Vote3Deep [12] uses a featurecentric voting scheme so that the computational cost is proportional to the number of points with nonzero features. Sparse Submanifold CNN [13] computes the convolution only at activated points whose number does not increase when the convolution layers are stacked. In comparison, SpiderCNN can use point clouds as input directly and can handle very sparse input. (See Section 5.1.)
2.3 Convolution on nonEuclidean domain
There are two main philosophically different approaches to define convolutions for nonEuclidean domains: one is spatial and the other is spectral. Our method is fundamentally different from those for we use the embedding of point sets in , instead of finding local parametrization of the domain or using Fourier transform. Meanwhile, by embedding Riemannian manifolds in [14], or using graph embedding techniques [15], our method offers an alternative approach in this new field.
2.3.1 Spatial methods
GeodesicCNN [16] is an early attempt at applying neural networks to shape analysis. The philosophy behind GeodesicCNN is that for a Riemannian manifold, the exponential map identifies a local neighborhood of a point to a ball in the tangent space centered at the origin. The tangent plane is isomorphic to where we know how to define convolution.
Let be a mesh surface, and let be a function, GeodesicCNN first uses a patch operator to map a point and its neighbors to the lattice , and applies Equation 3. Explicitly,
(1) 
2.3.2 Spectral methods
We know that Fourier transform takes convolutions to multiplications. Explicitly, If , then . Therefore, formally we have , ^{2}^{2}2If is a function, then is the Fourier transform, and is its inverse Fourier transform. which can be used as a definition for convolution on nonEuclidean domains where we know how to take Fourier transform.
Although we do not have Fourier theory on a general space without any equivariant structure, on Riemannian manifolds or graphs there are generalized notions of Laplacian operator, and taking Fourier transform in could be formally viewed as finding the coefficients in the expansion of the eigenfunctions of the Laplacian operator. To be more precise, recall that
(2) 
and are eigenfunctions for the Laplacian operator .
Therefore, if is the matrix whose columns are eigenvectors of the graph Laplacian matrix and is the vector of corresponding eigenvalues, for two functions on the vertices of the graph, then , where is the transpose of and is the Hadamard product of two matrices. Since being compactly supported in the spatial domain translates into being smooth in the spectral domain, it is natural to choose to be smooth functions in . For instance, ChebNet [19] uses Chebyshev polynomials that reduces the complexity of filtering, and CayleyNet [20] uses Cayley polynomials which allows efficient computations for localized filters in restricted frequency bands of interest.
When analyzing different graphs or shapes, spectral methods lack abstract motivations, because different spectral domains cannot be canonically identified. SyncSpecCNN [21] proposes a weight sharing scheme to align spectral domains using functional maps. In contrast, SpiderCNN uses the embedding information of 3D point clouds, and avoids such alignment problems.
3 SpiderConv
In this section, we describe SpiderConv, which is the fundamental building block for SpiderCNN. First, we discuss how to define a convolutional layer in neural network when the inputs are features on point sets in . Next we introduce a special family of convolutional filters. Finally, we give details for the implementation of SpiderConv with multiple channels and the approximations used for computational speedup.
3.1 Convolution on point sets in
An image is a function on regular grids . Let be a filter matrix, where is a positive integer, the convolution in classical CNNs is
(3) 
which is the discretization of the following integration
(4) 
if , such that for and for and is supported in .
Now suppose that is a function on a set of points in . Let be a filter supported in a ball centered at the origin of radius . It is natural to define SpiderConv with input and filter to be the following:
(5) 
Note that when is a regular grid, Equation 5 reduces to Equation 4. Thus the classical convolution can be seen as a special case of SpiderConv. Please see Figure 1 for an intuitive illustration.
In SpiderConv, the filters are chosen from a parametrized family which is piecewise differentiable in . During the training of SpiderCNN, the parameters are optimized through SGD algorithm, and the gradients are computed through the formula below:
(6) 
where is the th component of .
3.2 A special family of filters
A natural choice is to take to be a multilayer perceptron (MLP) network, because theoretically a MLP with one hidden layer can approximate an arbitrary continuous function [22]. However, in practice we find that this idea does not work well. ^{3}^{3}3Please see Appendix for the empirical evidence of this claim. One possible reason is that MLP fails to account for the geometric prior of 3D point clouds, and another possible reason is that to ensure sufficient expressiveness the number of parameters in a MLP needs to be sufficiently large, which makes the optimization problem intractable.
To address the above issues, we propose the following family of filters :
(7) 
with is the concatenation of two vectors and , ^{4}^{4}4Here we use the notation to represent that is the th component of the vector . where
(8) 
with , and
(9) 
The first component is a step function in the radius variable of the local polar coordinates around a point. It encodes the local geodesic information, which is a critical quantity to describe the coarse local shape. Moreover, step functions are relatively easy to optimize using SGD.
The order3 Taylor term further enriches the complexity of the filters, complementary to since it also captures the variations of the angular component. Let us be more precise about the reason for choosing Taylor expansions here from the perspective of interpolation. We can think of the classical 2D convolutional filters as a family of functions interpolating given values at 9 points , and the values serve as the parametrization of such a family. Analogously, in 3D consider the vertexes of a cube , assume that at the vertex the value is assigned. The trilinear interpolation algorithm gives us a function of the form
(10) 
where ’s are linear functions in . Therefore is a special form of , and by varying , the family can interpolate arbitrary values at the vertexes of a cube and capture rich spatial information.
3.3 Implementation
In this subsection, we give details of our implementations of SpiderConv.
Please note that the following approximations are used based on the uniform sampling process constructing the point clouds:

Knearest neighbors are used to measure the locality instead of the radius, so the summation in Equation 5 is over the Knearest neighbors of .

The step function is approximated by a permutation. Explicitly, let be the matrix indexed by the Knearest neighbors of including , and is a feature at the th Knearest neighbors of . Then is approximated by , where is a matrix with corresponds to in Equation 8.
Later in the article, we omit the parameters , and , and just write to simplify our notations.
The input to SpiderConv is a dimensional feature on a point cloud , and is represented as where . The output of a SpiderConv is a dimensional feature on the point cloud where . Let be a point in the point cloud, and are its nearest neighbors in order. Assume , where and and . Then a SpiderConv with inchannels, outchannels and Taylor terms is defined via the following formula:
(11) 
where
(12) 
where is in the parameterized family for .
4 Experiments
In this section, we conduct experiments on 3D point clouds classification and segmentation, to evaluate SpiderCNN. We empirically examine the key hyperparameters of a 3layer SpiderCNN, and compare with the stateoftheart methods. In all experiments, we implement the models with Tensorflow 1.3 on 1080Ti GPU. Adam optimizer is used. The learning rate is , and dropout rate is 0.5. Batch normalization is used at the end of each SpiderConv with decay 0.5.
4.1 Classification on ModelNet40
ModelNet40 [7] contains 12,311 CAD models which belong to 40 different categories with 9,843 for training and 2,468 for testing. We use the source code for PointNet[8] to sample 1,024 points uniformly and compute the normal vectors from the mesh models. The same data augmentation strategy as [8] is applied: the point cloud is randomly rotated along the upaxis and the position of each point is jittered by a Gaussian noise with zero means and 0.02 standard deviation. The batch size is 32 for all the experiments in Section 4.1. We use the coordinates and normal vectors of the 1,024 points as the input for SpiderCNN for the experiments on ModelNet40 unless otherwise specified.
4.1.1 3layer SpiderCNN
Figure 3 illustrate a SpiderCNN with 3 layers of SpiderConvs with 3 Taylor terms, and the outchannels are 32, 64, 128. ^{5}^{5}5See Section 3.3 for the definition of a SpiderConv with inchannels, outchannels and Taylor terms. ReLU activation function is used here. The output features of three SpiderConvs are concatenated in the end. Instead of maxpooling, we use top pooling to extract richer features.
Two important hyperparameters in SpiderCNN are studied: the number of nearest neighbors chosen in SpiderConv, and the number of pooled features after the concatenation. The results are summarized in Figure 4. The number of nearestneighbors corresponds to size of the filter in the usual convolution. We see that 20 is the optimal choice among 12, 16, 20, and 24nearest neighbors. In Figure 5 we provide visualization for top2 pooling. The points that contribute to the top2 pooling features are plotted. We see that similar to PointNet, Spider CNN picks up representing critical points. Note that top1 pooling is the same as maxpooling. Comparing to maxpooling, top2 pooling enables the model to learn richer geometric information. For example, in Figure 6, we see top2 pooling preserves more points where the curvature is nonzero.
4.1.2 SpiderCNN + PointNet
We train a 3layer SpiderCNN (top2 pooling and 20nearest neighbors) and PointNet with only coordinates as input to predict the classical robust local geometric descriptor FPFH [23] on point clouds in ModelNet40, the training loss of SpiderCNN is only that of PointNet’s. As a result, we believe that a 3layer SpiderCNN and PointNet are complementary to each other, for SpiderCNN is good at learning local geometric features and PointNet is good at capturing global features. By concatenating the 128 dimensional features from PointNet with the 128 dimensional features from SpiderCNN, we improve the classification accuracy to .
4.1.3 4layer SpiderCNN
Experiments show that 1layer SpiderCNN with a SpiderConv of 32 channels can achieve classification accuracy , and the performance of SpiderCNN improves with the increasing number of layers of SpiderConv. A 4layer SpiderCNN consists of SpiderConv with outchannels 32, 64, 128, and 258. Feature concatenation, top2 pooling and 20nearest neighbors are used in the same way as they are used for a 3layer SpiderCNN. To prevent overfitting, we further apply a data augmentation method DP (random input dropout) introduced in [9] during the training for a 4layer SpiderCNN. Table 1 shows the comparison between SpiderCNN and other models. We see that 4layer SpiderCNN achieves accuracy of which improves the best reported result of models that directly process point clouds.
Method  Input  Accuracy 

Subvolume [24]  voxels  89.2 
VRN Single [5]  voxels  91.3 
OctNet [4]  hybrid grid octree  86.5 
ECC [25]  graphs  87.4 
KdNetwork [10] (depth 15)  1024 points  91.8 
PointNet [8]  1024 points  89.2 
PointNet++ [9]  5000 points+normal  91.9 
SpiderCNN + PointNet  1024 points+normal  92.2 
SpiderCNN (4layer)  1024 points+normal  92.4 
4.2 Classification on SHREC15
SHREC15 is a dataset for nonrigid 3D shape retrieval. It consists of 1,200 watertight triangle meshes divided in 50 categories. On average 10,000 vertices are stored in one mesh model. Comparing to ModelNet40, SHREC15 contains more complicated local geometry and nonrigid deformation of one object. See Figure 7 for a comparison.
1,192 meshes are used with 895 for training and 297 for testing. We compute three intrinsic shape descriptors (Heat Kernel Signature, Wave Kernel Signature and Fast Point Feature Histograms) for deformable shape analysis from the mesh models. 1,024 points are sampled uniformly randomly from the vertices of a mesh model, and the coordinates are used as the input for SpiderCNN, PointNet and PointNet++. ^{6}^{6}6 Please note that the vertices are not uniformly distributed. We use SVM with linear kernel when the inputs are classical shape descriptors. Table 2 summarizes the results. We see that SpiderCNN outperforms the other methods.
Method  Input  Accuracy 

SVM + HKS  features  56.9 
SVM + WKS  features  87.5 
SVM + FPFH  features  80.8 
PointNet  points  69.4 
PointNet++ [9]  points  60.2 
PointNet++(our implementation)  points  94.1 
SpiderCNN (4layer)  points  95.8 
4.3 Segmentation on ShapeNet Parts
ShapeNet Parts consists of 16,880 models from 16 shape categories and 50 different parts in total, with a 14,006 training and 2,874 testing split. Each part is annotated with 2 to 6 parts. The mIoU is used as the evaluation metric, computed by taking the average of all part classes.
A 4layer SpiderCNN whose architecture is shown in Figure 9 is trained with batch of 16. We use points with their normal vectors as the input and assume that the category labels are known. The results are summarized in Table 3. We see that SpiderCNN achieves competitive results despite a relatively simple network architecture.
mean  aero  bag  cap  car  chair  ear  guitar  knife  lamp  laptop  motor  mug  pistol  rocket  skate  table  

ph  board  
PN [8]  83.7  83.4  78.7  82.5  74.9  89.6  73.0  91.5  85.9  80.8  95.3  65.2  93.0  81.2  57.9  72.8  80.6 
PN++[9]  85.1  82.4  79.0  87.7  77.3  90.8  71.8  91.0  85.9  83.7  95.3  71.6  94.1  81.3  58.7  76.4  82.6 
KdNet [10]  82.3  80.1  74.6  74.3  70.3  88.6  73.5  90.2  87.2  81.0  94.9  57.4  86.7  78.1  51.8  69.9  80.3 
SSCNN [21]  84.7  81.6  81.7  81.9  75.2  90.2  74.9  93.0  86.1  84.7  95.6  66.7  92.7  81.6  60.6  82.9  82.1 
SpiderCNN  85.3  83.5  81.0  87.2  77.5  90.7  76.8  91.1  87.3  83.3  95.8  70.2  93.5  82.7  59.7  75.8  82.8 
5 Analysis
In this section, we conduct additional analysis and evaluations on the robustness of SpiderCNN, and provide visualization for some of the typical learned filters from the first layer of SpiderCNN.
5.1 Robustness
We study the effect of missing points on SpiderCNN. Following the setting for experiments in Section 4.1, we train a 4layer SpiderCNN and PointNet++ with 512, 248, 128, 64 and 32 points as input. The results are summarized in Figure 10. We see that even with only 32 points, SpiderCNN obtains accuracy.
5.2 Visualization
In Figure 11, we scatter plot the convolutional filters learned in the first layer of SpiderCNN and the color of a point represents the value of at the point.
In Figure 12 we choose a plane passing through the origin, and project the points that lie on one side of the plane of the scatter graph onto the plane. We see some similar patterns that appear in 2D image filters. The visualization gives some hints about the geometric features that the convolutional filters in SpiderCNN learn. For example, the first row in Figure 12 corresponds to 2D image filters that can capture boundary information.
6 Conclusion and Future Work
We proposed a new convolutional neural network SpiderCNN that can directly process 3D point clouds with parameterized convolutional filters, and learn features in a multiscale hierarchical way. Experiments on 3D point classification and segmentation task demonstrate the effectiveness of SpiderCNN. SpiderCNN provides a general framework for processing signals whose domains can be embedded in . For future work, we plan to explore more applications of SpiderCNN.
References
 [1] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097–1105
 [2] Zhang, Z.: Microsoft kinect sensor and its effect. IEEE multimedia 19(2) (2012) 4–10
 [3] Maturana, D., Scherer, S.: Voxnet: A 3d convolutional neural network for realtime object recognition. In: Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, IEEE (2015) 922–928
 [4] Riegler, G., Ulusoy, A.O., Geiger, A.: Octnet: Learning deep 3d representations at high resolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Volume 3. (2017)
 [5] Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Generative and discriminative voxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236 (2016)
 [6] Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34(4) (2017) 18–42
 [7] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
 [8] Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1(2) (2017) 4
 [9] Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems. (2017) 5105–5114
 [10] Klokov, R., Lempitsky, V.: Escape from cells: Deep kdnetworks for the recognition of 3d point cloud models. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE (2017) 863–872
 [11] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep sets. In: Advances in Neural Information Processing Systems. (2017) 3394–3404
 [12] Engelcke, M., Rao, D., Wang, D.Z., Tong, C.H., Posner, I.: Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In: Robotics and Automation (ICRA), 2017 IEEE International Conference on, IEEE (2017) 1355–1361
 [13] Graham, B., Engelcke, M., van der Maaten, L.: 3d semantic segmentation with submanifold sparse convolutional networks. arXiv preprint arXiv:1711.10275 (2017)
 [14] Nash, J.: The imbedding problem for riemannian manifolds. Annals of mathematics (1956) 20–63
 [15] Goyal, P., Ferrara, E.: Graph embedding techniques, applications, and performance: A survey. arXiv preprint arXiv:1705.02801 (2017)
 [16] Masci, J., Boscaini, D., Bronstein, M., Vandergheynst, P.: Geodesic convolutional neural networks on riemannian manifolds. In: Proceedings of the IEEE international conference on computer vision workshops. (2015) 37–45
 [17] Boscaini, D., Masci, J., Rodolà, E., Bronstein, M.: Learning shape correspondence with anisotropic convolutional neural networks. In: Advances in Neural Information Processing Systems. (2016) 3189–3197
 [18] Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., Bronstein, M.M.: Geometric deep learning on graphs and manifolds using mixture model cnns. In: Proc. CVPR. Volume 1. (2017) 3
 [19] Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in Neural Information Processing Systems. (2016) 3844–3852
 [20] Levie, R., Monti, F., Bresson, X., Bronstein, M.M.: Cayleynets: Graph convolutional neural networks with complex rational spectral filters. arXiv preprint arXiv:1705.07664 (2017)
 [21] Yi, L., Su, H., Guo, X., Guibas, L.: Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In: Computer Vision and Pattern Recognition (CVPR). (2017)
 [22] Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural networks 4(2) (1991) 251–257
 [23] Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (fpfh) for 3d registration. In: Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, IEEE (2009) 3212–3217
 [24] Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multiview cnns for object classification on 3d data. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 5648–5656
 [25] Simonovsky, M., Komodakis, N.: Dynamic edgeconditioned filters in convolutional neural networks on graphs. In: Proc. CVPR. (2017)
Appendix A Appendix
In this section, we provide some additional details and experimental results.
a.1 Taylor v.s. MLP
Recall that the filter used in SpiderCNN decomposes as
We study empirically the effect of replacing with MLP in a 4layer SpiderCNN for classification on ModelNet40 (Section 4.1). The results are summarized in Table 4. We see that Taylor outperformances MLP with more parameters.
Methods  MLP(10, 5, 3)  MLP(6, 5, 3)  MLP(5, 3, 3)  order3 Taylor 

Accuracy(%)  91.6  91.1  91.4  92.4 
a.2 Number of parameters in Taylor
Recall that is chosen to be order3 expansion in SpiderCNN, and the trilinear interpolation gives us a simpler expansion:
We study the effect of replacing with , order2 Taylor and linear Taylor in 4layer SpiderCNN for classification on ModelNet40. The results are summarized in Table 5.
Method  order3 Taylor  order2 Taylor  linear Taylor  

Accuracy(%)  92.4  91.9  91.6  91.9 
One justification for using oder3 Taylor instead of is that if we compose with a rigid transformation , terms like will appear. For instance, under the transformation , the expression becomes .
a.3 SpiderCNN + PointNet
Figure 13 shows the architecture of combing a 3layer SpiderCNN with PointNet in Section 4.1. Table 6 summarizes the classification results on ModelNet40. We see that the combined model outperforms 3layer SpiderCNN and PointNet.
Methods  3layer SpiderCNN  PointNet  SpiderCNN + PointNet 

Accuracy(%)  91.5  90.3  92.2 
a.4 Use MLP as the family of filters
It seems natural to choose MLP to be the parameterized family of filters . However, we did not figure out a method to optimize such a family.
A SpiderConv(MLP) with inchannel is defined as follows
where is the input feature from the th channel, is a MLP and is the output feature.
We implement a SpiderCNN with 2layers of SpiderConv(MLP(10, 5, 1)), where the channels are 32 and 64. We also implement a 2layer SpiderCNN with SpiderConv described in Section 3.3, with 3 Taylor terms and channels 32 and 64. Batch normalization, ReLU nonlinearity, 20nearest neighbors, and maxpooling are used for both models for fair comparisons. Table 7 shows the experimental results. We see that MLP filters perform poorly due to the difficulty of optimization.
Methods  MLP(10, 5, 1)  

Test Accuracy(%)  20.3  87.0 
Training Accuracy(%)  14.9  87.2 