CrossShape Graph Convolutional Networks
Abstract
We present a method that processes 3D point clouds by performing graph convolution operations across shapes. In this manner, point descriptors are learned by allowing interaction and propagation of feature representations within a shape collection. To enable this form of nonlocal, crossshape graph convolution, our method learns a pairwise point attention mechanism indicating the degree of interaction between points on different shapes. Our method also learns to create a graph over shapes of an input collection whose edges connect shapes deemed as useful for performing crossshape convolution. The edges are also equipped with learned weights indicating the compatibility of each shape pair for crossshape convolution. Our experiments demonstrate that this interaction and propagation of point representations across shapes makes them more discriminative. In particular, our results show significantly improved performance for 3D point cloud semantic segmentation compared to conventional approaches, especially in cases with limited number of training examples.
Keywords:
geometric deep learning, 3D point clouds, shape segmentation, crossshape convolution, crossattention.1 Introduction
Learning geometric representations is fundamental to shape understanding and processing. Over the recent years, there has been significant research in developing deep networks that operate directly on 3D point clouds. Inspired by advances of deep learning on graphs, several architectures have been proposed to learn pointwise representations of shapes through graph convolution and attention layers [1, 2, 3, 4, 5, 6, 7]. The layers of these networks output a representation for each point by weighting and aggregating representations and relations with other points within the same input point set. In this manner, these networks hierarchically encode shape structure useful for performing highlevel tasks, such as shape segmentation.
In this paper, we propose to extend graph convolution and attention to operate across shapes of an input collection (Figure 1). In our architecture, the representation of a point in a shape is learned by combining representations originating from points in the same shape as well as other shapes. The rationale for such approach is that if a point on one shape is related to a point on another shape e.g., they lie on geometrically or semantically similar patches or parts, then crossshape convolution can promote consistency in their resulting representations and part label assignments. Our approach is also inspired by early shape segmentation approaches that transfer deformable part templates across shapes by alternating between estimating point correspondences and alignment [8, 9]. In our case we do not estimate correspondences or alignment explicitly and we do not rely on hangengineered part templates or pairwise terms. We instead leverage graph attention to determine and weigh pairs of points on different shapes. We integrate these weights in our crossshape convolution scheme to hierarchically learn point representations.
Developing such crossshape convolution approach poses a number of technical challenges. First, performing graph convolution across allpairs of points and allpairs of shapes becomes prohibitively expensive for large input collections of shapes. Our architecture learns global shape descriptors along with a pairwise shape compatibility function that allows us to efficiently select a set of candidate shapes and assess their usefulness for crossshape convolution for each input shape. For example, given an input office chair, it is more useful to allow interactions of its points with points of another office chair rather than a stool. Furthermore, given a point on a shape, its interactions with other points of another shape are not equally important. We incorporate a crossshape attention function that predicts the degree of interaction between pairs of points on different shapes. Another challenge is that training the above attention function requires discovering shapes useful for crossshape convolution in the first place. To this end, during training, we maintain a sparse graph (called “shape collection graph”, Figure 1), whose nodes represent training shapes and edges specify which pairs of shapes should interact for crossshape convolution. During training, the graph edges are dynamically updated according to the learned pairwise shape compatibility function such that increasingly more informative shapes are selected for crossshape convolution per each training shape. At test time, the shape collection graph is augmented with additional nodes representing test shapes (Figure 1). New edges are added connecting them to training shapes for propagating representations from relevant training shapes through our learned crossshape convolution.
Our architecture integrates the popular DGCNN network [1] as backbone. We extended it with our crossshape convolution layers and trained our new architecture endtoend for semantic shape segmentation. Our experiments indicate significantly higher performance on the recent PartNet dataset [10]. Compared to our backbone, we found an improvement of in part IoU on average in PartNet for finegrained shape segmentation. In particular, we found that our method improves the part IoU by on average for PartNet categories with limited number of training shapes compared to our backbone, demonstrating the utility of our crossshape convolution scheme especially in such scenarios of limited training data.
2 Related work
We briefly overview related work on 3D deep learning for point clouds. We also discuss crossattention networks developed in other domains.
3D deep learning for processing point clouds.
Several different types of neural networks have been proposed for processing point sets over the recent years. After the pioneering work of PointNet [11, 12], several works further investigated hierarchical point aggregation mechanisms to better model the spatial distribution of points [13, 14, 15]. Alternatively, point clouds can be projected onto local views [16, 17, 18, 19] and processed as regular grids through imagebased convolutional networks. Another line of work converts point representations into volumetric grids [20, 21, 22, 23, 24, 25] and processes them through 3D convolutions. Instead of uniform grids, hierarchical space partitioning structures (e.g., kdtrees, lattices) can be used to define regular convolutions [26, 27, 28, 29, 30]. Another type of networks incorporate pointwise convolution operators to directly process point clouds [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42]. Alternatively, shapes can be treated as graphs by connecting each point to other points within neighborhoods in a feature space. Then graph convolution and pooling operations can be performed either in the spatial domain [1, 43, 44, 45, 46, 2, 47, 3, 4, 5, 6, 48, 15], or spectral domain [49, 50, 51, 52]. Attention mechanisms have also been investigated to modulate the importance of graph edges and pointwise convolutions [33, 5, 6, 7]. Finally, graph neural network approaches have been shown to model nonlocal interactions between points within the same shape [1, 3, 5, 48].
None of the above approaches have investigated the possibility of extending convolution or attention across shapes in a collection. Our work shows that these crossshape operations are not only possible, but also provide significant improvements over conventional approaches.
Crossattention in other domains.
Our method is inspired by recent crossattention models proposed for video classification, image classification, keypoint recognition, and imagetext matching. Wang et al. [53] introduced nonlocal networks that allow any image query position to perceive features of all the other positions within the same image or across frames in a video. To avoid huge attention maps, Huang et al. proposes a “crisscross” attention module [54] to maintain sparse connections for each position in image feature maps. Cao et al. [55] simplifies nonlocal blocks with queryindependent attention maps [55]. Lee et al. [56] propose crossattention between text and images to discover latent alignments between images regions and words in a sentence. Hou et al. [57] models the semantic relevance between class and query feature maps in images through crossattention to localize more relevant image regions for classification and generate more discriminative features. Finally, Sarlin et al. [58] learns keypoint matching between two indoor images from different viewpoints by leveraging selfattention and crossattention to boost the receptive field of local descriptors and allow crossimage communication.
Our method introduces attention mechanisms across 3D shapes. Apart from the obvious need to develop crossattention and nonlocal convolution operations on the irregular format of point clouds, our method also models the input training collection as a graph whose edges connect instances (shapes) deemed as useful for training our crossattention and nonlocal convolution operations. The usefulness of shape pairs is determined based on a learned shape compatibility function trained together with the rest of our network.
3 Method
Overview.
Given an input collection of 3D shapes represented as point clouds, the goal of our method is to propagate representations from one shape to another, and perform semantic segmentation. To perform this propagation, we propose crossshape, nonlocal convolution operations. These operations update point representations on one shape by performing nonlocal convolutions with point representations originating from other shapes. This interaction is regulated by a CrossShape Attention (CSA) layer that predicts how much pairs of points on different shapes should influence each other. Since performing this exchange of information between allpairs of shapes can become computationally expensive for large collections, we also present a technique to embed shapes of the input collection in a sparse graph, which we call shape collection graph. The nodes of this graph are shapes (Figure 1), and edges specify which pairs of shapes should interact. The edges in this graph also carry weights indicating the compatibility of two shapes for crossconvolution. During training time, this graph is dynamically constructed based on the input training shapes. At test time, the graph is augmented with additional nodes representing test shapes, and edges represent connections between test and training shapes such that the information propagates from the training to the test set.
In the following Section 3.1, we first describe our crossshape attention layer for a pair of shapes. We then discuss its generalization to multiple shapes along with the pairwise shape compatibility function in Section 3.2. We discuss our network architecture in 3.3, and its training along with the construction of the shape collection graph in Section 3.4. Finally, we discuss the augmentation of the collection graph at test time in Section 3.5.
3.1 Crossshape attention for a pair of shapes
We now introduce our CrossShape Attention (CSA) layer for processing a pair of shapes. We assume that both shapes are represented as sets of points and that each point is equipped with a dimensional representation. Our layer is agnostic to the specifics of the point representation. It can be either raw input (e.g., 3D point positions) or learned representations extracted from existing point processing networks. In our implementation, we used point representations produced by the widely popular DGCNN [1]. Given a shape with point representations stacked in a matrix , and a shape with point representations similarly stacked in a matrix , (, are the number of points in the two shapes), the output of our layer are new pointwise representations for both shapes:
(1) 
where is our crossshape attention function with learned parameters . The function implements the transformations explained in the next paragraphs.
Key and query intermediate representations.
Inspired by recent attention networks [59], we first transform the input point representations of the first shape in the pair to intermediate representations, called “query” representations. The input point representations of the second shape are transformed to intermediate “key” representations. The keys will be compared to queries to determine the degree of influence of one point on another. Specifically, in the case of processing the shape pair , these transformations are expressed as follows:
(2) 
where is the representation of point on shape , is the representation of point on shape , and are learned weight matrices. The same matrices are applied to all points of the first shape and points of the second shape respectively to ensure invariance to point permutations. Processing the reverse pair yields different key and query representations i.e., the layer is not invariant to the order of the two shapes in the pair. As a result, the influences of one point to another will be asymmetric, as explained below.
Pairwise point attention.
The similarity of key and query representations is determined through scaled dot product [59]. This provides a measure of how much one shape point influences the point on the other shape, on in other words, how much crossshape convolution should attend to this pair. By concatenating all keys and queries into matrices and we can compute crossshape attention for the input pair as follows:
(3) 
where is a matrix. We note that softmax is applied per row. Processing the reverse pair yields a different attention matrix . This is possible due to the fact that shapes can generally vary in part structure. Letting the attention mechanism to adjust rows and columns of the two matrices without imposing symmetry may better account for structural shape differences. We visualize crossshape attention matrices for characteristic shape pairs in Figure 2. Since our network is trained for segmentation, the crossshape attention seems to correlate input points from one shape to points of similar parts from the other shape, especially in the upper layers of our network. However, crossshape attention might also focus on points between different parts to promote dissimilarities in their representations.
Complexity.
The above operation is expensive since it involves computation to update the attention matrix. The computation can be accelerated by using range searches with space partitioning structures (e.g., e.g., kdtree on dot products [60]) built on top of the key representations, and updating the matrix only for key representations nearest to the queries. A simpler approach we experimented with is to maintain a subset of points as keys (uniformly subsampled from the original shape) i.e., attention matrix becomes , where is the number of subsampled keys (see also supplementary material for its effect). In this manner, the attention matrix becomes sparser.
Crossshape convolution.
We now define the nonlocal, crossshape convolution operator that uses the above pairwise attention matrix to update the point representations for shape :
(4) 
where is a learned transformation (same for all points to ensure invariance to permutations), and accesses the corresponding point attention value (a scalar) for the pair . To accelerate the computation, we can skip the summation for points not associated with keys due to subsampling.
Selfshape attention.
The pairwise point attention of Eq. 3 and nonlocal convolution operator of Eq. 4 can also be applied to a pair that consists of the shape and itself. In this case, our CSA layer implements a form of SelfShape Attention (SSA), enabling longrange interactions between shape points modulated by our pairwise point attention mechanism.
3.2 Crossshape attention for multiple shapes
We now generalize the nonlocal convolution of Eq. 4 to handle updates from multiple shapes, and also combine crossshape attention with selfshape attention. Here we assume that the input to our CSA layer is a shape from an input collection, and a set of other shapes deemed as compatible for crossshape convolution with this shape. We discuss how this set is selected in Section 3.4 during training, and Section 3.5 during testing. Given a set of shapes as input, our CSA layer outputs point representations for shape as follows:
(5) 
where is a learned pairwise function that outputs a single scalar representing the compatibility between shape and . The key idea of the above operation is to update point representations of shape as a weighted average of attentionmodulated representations computed by using other shapes as well as the shape itself. The compatibility function assesses these weights that different shapes should have for crossshape convolution. It also implicitly provides the weight of selfshape attention when .
Compatibility function.
To compute the compatibility function, we first extract a global descriptor for the shape and for each other shape in the input compatible set . These descriptors are extracted by a DGCNN network [1] dedicated to extract descriptors for shape compatibility. Specifically, it performs max and average pooling on individual point representations concatenated from all DGCNN point processing layers. Given point representations for shape , its global descriptor is computed as: , and similarly for the other shape descriptors. We then compare these global descriptors through scaled dot product attention [59]:
(6) 
where , and are learned transformations, and is the dimensionality of the global descriptors. The resulting comparison of descriptors provide us a measure of compatibility between two shapes (or a shape with itself). We normalize the above measures so that the sum of compatibilities of the shape with all other shapes in the set , including the selfcompatibility (i.e., the weight of selfshape attention) is through softmax:
(7) 
As discussed in Section 3.4, our method learns the pairwise shape compatibilities to maximize the segmentation performance during training.
3.3 Architecture
We now discuss how we combined the CSA layers in our network architecture. We visualize our implemented architecture, called CrossShapeNet, in the case of processing a shape along with another compatible shape in Figure 3. As discussed earlier, we use DGCNN layers as our backbone. Specifically, given a shape with 3D point positions as input, DGCNN uses a sequence of graph convolution layers, called EdgeConv layers, to output perpoint representations for each EdgeConv layer (and similarly for the other shape ). We attach a CSA layer processing the outputs of each corresponding EdgeConv layer. Each CSA layer has its own learned weight matrices specific to each layer. It outputs new pointwise representations stacked in a matrix based on Equation 5.
Then all point representations from all CSA and corresponding DGCNN layers are concatenated for each point forming the final pointwise representations . We observed that better performance can be achieved by applting CSA on all DGCNN layers of representations instead of only the last one. The point representations are mapped to part label probabilities through a threelayer MLP and softmax. We provide more detail about the exact configuration and dimension of layers in our supplementary material.
3.4 Training
We now discuss our training procedure to train CrossShapeNet. The input to our training procedure is a training collection of labeled point clouds with part annotations, along with a smaller annotated collection used for holdout validation. To train our CSA layers, we first need to form compatible sets for each shape used in Equation 5. The compatible sets are created by defining the onering neighbors of each shape in the shape collection graph (Figure 1, left). We maintain one such graph for the training collection, and another graph for the holdout validation one. Below we discuss the initialization of the collection graphs, then discuss training of CrossShapeNet and updates to the graphs.
Shape collection graph initialization.
We first connect each shape to its nearest neighbors computed through global descriptors extracted from DGCNN [1] pretrained for classification on ModelNet40. These neighbors form an initial estimate of our “compatible” shape set for each training shape . The is an input parameter to our method. We discuss its effect for different values in our results section. Given this initial collection graph, each training shape can be processed through CrossShapeNet. We train it using crossentropy: where denotes the groundtruth label for point of training shape , and the output probability distribution over part labels per point. During training, the compatibility function also receives supervisory signal from the above part labeling loss. As a result, the pairwise shape compatibilities, selfcompatibilities, and global shape descriptors used for their computation, are updated during training.
We also create a collection graph in our holdout validation set using the same procedure and monitor holdout validation loss. We train CrossShapeNet until the validation loss saturates, then we update the shape collection graph.
Shape collection graph update.
Based on the updated global shape descriptors, new nearest neighbors for each shape are picked based on the compatibility measure of Equation 6. These new neighbors result in updating the collection graphs for training and holdout validation, and in turn form new “compatible” sets of shapes for each training shape. After this update, we rerun training with the above loss for more epochs.
We alternate between shape collection graph updating and CrossShapeNet training. The holdout validation collection graph is also updated and used to monitor training, and stop it when the validation performance saturates.
Implementation details.
We use the Adam optimizer with 0.001 learning rate. The batch size depends on the the number of neighbors : we use batch sizes , and for CrossShapeNet with =, and respectively. We provide more details on our architecture in the supplementary material. We note that our implementation will become publicly available after review process.
3.5 Test time
At test time, for each test shape, we find the nearest training shapes to define its neighborhood in the shape collection graph based on our compatibility measure (Equation 6). Since the measure involves dot products between descriptors, the computation can run reasonably fast in our implementation (it can also be accelerated with kdtrees [60]). The trained CrossShapeNet is then used to extract the part annotations for the test shapes.
4 Results
width=1.0center=
bed  bott  chair  clock  dish  disp  door  ear  fauc  knife  lamp  micro  frid  stor  table  trash  vase  avg  


DGCNN  33.1  47.5  36.4  22.2  45.7  78.7  23.1  40.6  52.3  28.3  20.0  34.9  28.5  41.2  27.0  43.1  50.7  38.4 


CrossShapeNetSSA 
29.5  48.8  34.7  24.8  44.2  78.2  33.0  44.8  54.0  31.8  20.4  42.7  35.5  41.6  25.1  40.5  49.9  40.0 


CrossShapeNetK1 
34.7  44.4  34.5  23.2  40.6  78.3  41.8  46.6  51.5  37.8  19.4  41.5  34.1  43.1  29.2  45.8  51.0  41.0 
CrossShapeNetK3  35.6  45.7  35.4  28.8  31.3  79.0  35.0  46.2  53.9  33.7  20.3  49.7  39.7  43.9  27.8  42.6  52.2  41.2 
CrossShapeNetK5 
36.9  41.1  34.2  31.3  43.8  77.6  24.2  46.1  54.0  38.0  19.7  48.8  41.5  41.8  27.7  46.0  52.7  41.5 
N shapes  212  464  6400  579  201  954  245  247  708  384  2271  212  207  2303  8309  340  1104 
We evaluated our method for finegrained shape segmentation both qualitatively and quantitatively. Below we discuss the dataset, evaluation metrics, and comparisons with our backbone (DGCNN [1]) and other stateoftheart models.
Dataset.
We use the recent PartNet dataset [10] for training and evaluating our method according to its provided training, validation, and testing splits. Similarly to the evaluation done in [3] and [15], our evaluation focuses on the finegrained level of semantic segmentation, which includes out of the object categories present in the PartNet dataset. To train our method, we used points randomly subsampled from the original points provided for each training shape in PartNet. This resolution is similar to the one used in the original DGCNN for shape classification and segmentation [1]. We performed this subsampling to enable faster training. Our network and also DGCNN, can handle different number of points as input. Thus, at test time, we process all points for test shapes. Compared to testing on points and then upsampling the network outputs to points through nearest neighbor interpolation, our strategy of processing the higher resolution point clouds directly worked better (see supplementary material for more details). For evaluation, we use the standard performance metrics of part and shape mIoU to evaluate our method and alternatives. Like other methods ([3] and [15]), we emphasize part IoU in our evaluation, since it better reflects the labeling accuracy of finegrained parts in each shape.
Comparison with our backbone.
Our crossshape convolution layers are built on top of the EdgeConv layers of DGCNN [1]. Thus, the primary goal in our evaluation is to examine how much crossshape convolution improves our backbone. We train our backbone in the same training splits under the same point cloud resolution, using only points as input (no normals), and testing on points as in our network. Training and testing of DGCNN and our network is done for each shape category separately.
Table 1 reports part IoU for the DGCNN backbone, and the following variants of CrossShapeNet: (a) CrossShapeNetSSA uses CSA layers that compute attention and perform convolution of each shape with itself only i.e., using only selfshape attention based on Equation 4, (b) CrossShapeNetK1 is a network that uses the general form of CSA layers (Equation 5) combining selfshape attention and crossshape attention of each shape with other shape (c) CrossShapeNetK3 is same as above but using other shapes in crossshape convolution. (d) CrossShapeNetK5 is same as above but using other shapes.
Based on the results, we make the following observations. First, the performance increases using selfshape attention in most classes relative to our backbone (average part mIoU increases from ). Then performance is further increased in most classes using crossshape convolution based on (average part mIoU increases from ). The best performance is achieved using in most classes (average part mIoU increases from ). In terms of shape IoU, our best crossshape convolution model CrossShapeNetK5 also increases it (, see supplementary material for more details.)
Our main observation is that the improvements are more common in categories with relatively fewer training shapes:

For categories with less than training shapes (i.e., excluding Chair, Lamp, Storage Furniture, Table and Vase), the average part IoU increases from in DGCNN to in CrossShapeNetK5, representing an increase of .

For categories with less than training shapes (these are: Bed, Dishwasher, Door, Earphone, Microwave, Refrigerator), the average part mIoU raises from in DGCNN to in CrossShapeNetK5, representing an increase of .
Figure 4 shows characteristic examples of point cloud segmentations based on our DGCNN backbone and CrossShapeNetK5 along with groundtruth segmentations. We observe that our network is able to provide finer level of segmentation, especially for smaller parts or parts with complex geometry.
Comparisons with stateoftheart.
We now discuss comparisons with two other methods that recently demonstrated stateoftheart performance in PartNet. ResGCN [3] showed the benefit of residual/dense connections and dilated convolutions in graph neural networks, making them also very deep. DeepConvPN [15] introduces multiresolution, residual, and crosslink blocks to process multiscale and multiresolution information to improve the original PointNet++ [12] and DGCNN [1] modules, also doubling the number of their layers. Our work instead focused on the concept of crossshape attention, which can be considered as orthogonal to the improvements of the two other models.
Table 2 shows part IoU for each of the PartNet categories for which the other two methods report their performance. For these comparisons, we report the performance of “CrossShapeNetK5” along with another variant, called “CrossShapeNetKVal”. For this variant, we train three crossshape convolution models for in each shape category, and selected the model whose yielded the best holdout validation performance (part IoU). In the last column, we report the part IoU averaged over the categories.
Our “CrossShapeNetK5” model compares favorably to the above two methods, resulting on an average part IoU which is lower than DeepConvPN, and lower than the much deeper ResGCN28 (28 layers). Selecting the best performing value for crossshape convolution (“CrossShapeNetKVal”) has a small edge of better performance compared to the stateoftheart. We emphasize that these comparisons are not necessarily fair since ResGCN and DeepConvPN are deeper, while CrossShapeNetKVal is based on an ensemble. The comparisons can be used as reference to evaluate the orthogonal improvements of all these models.
width=1.0center=
bed  bott  chair  clock  dish  disp  door  ear  fauc  knife  lamp  micro  frid  stor  table  trash  vase  avg  


CrossShapeNetK5 
36.9  41.1  34.2  31.3  43.8  77.6  24.2  46.1  54.0  38.0  19.7  48.8  41.5  41.8  27.7  46.0  52.7  41.5 
CrossShapeNetKVal 
36.9  44.4  35.3  31.3  43.8  79.0  41.8  46.2  51.5  38.0  20.3  49.7  39.7  43.9  29.2  45.8  52.7  42.9 


ResGCN28 [3] 
35.2  36.8  33.8  32.6  52.7  84.4  42.5  41.9  49.7  35.4  20.0  54.3  46.1  42.5  14.8  49.7  50.8  42.5 
DeepConvPN [15] 
29.5  42.1  41.8  34.7  33.2  81.6  34.8  49.6  53.0  44.8  28.4  33.5  32.3  41.1  36.3  43.1  57.8  42.2 


N shapes 
212  464  6400  579  201  954  245  247  708  384  2271  212  207  2303  8309  340  1104 
5 Conclusion
We presented a new type of graph neural network for 3D shape processing that enables point interactions and information exchange between shapes in an input collection. Our experiments demonstrated significant improvements of using our crossshape convolution and attention layers over conventional graph convolution approaches, such as DGCNN [1], especially in the regime of limited number of training examples.
There are several avenues for future work. First, computing self and crossattention is intensive for large or even moderatelysized point clouds. In our preliminary experiments we investigated a subsampling approach to make the crossshape attention matrix sparser, yet we believe that using hierarchical models and spatial subdivision structures would be better alternatives. Second, it would be interesting to incorporate deeper and wider backbones in our method, such as the ones proposed in [3] and [15]. In addition, we did not fully exploit the concept of the shape collection graph. It is possible to enable interactions also between test shapes. Finally, we suspect that our method will be useful for other tasks in point cloud processing, such as classification and correspondences, especially in fewshot learning regimes.
Appendix A Supplementary experiments
In this appendix, we provide additional experiments and comparisons regarding: (a) subsampling vs keeping the original keys for CSA, (b) upsampling point labels versus testing directly on higher resolution point clouds. We also report evaluation based on the shape mIoU metric. Finally, we provide additional architecture and training details, as well as plots demonstrating the utility of updating the shape collection graph during training.
Subsampling keys.
As explained in Section 3.1, to accelerate the crossconvolution operator, the number of keys can be reduced to sparsify the crossshape attention matrix. We conducted an experiment to evaluate the effect of this subsampling. Specifically, we reduced the keys down to 1K points by random subsampling (a x factor in reduction for this experiment). This led to a faster forward pass through all four CSA layers by factor of . Specifically, a feedforward pass of 2 shapes in a CSA layer took ms instead of ms on a NVidia RTX 2070 used for benchmarking. Results are presented in Table 3 in terms of part IoU (“CrossShapeNetSubK3” means CSA with subsampling). They suggest that downsampling keys during training is computationally beneficial, while it does not lead to performance drop. As discussed in our main paper, investigating other strategies, such as a hierarchical approach or using kdtrees in highdimensional spaces, could prove more beneficial.
width=1.0center=
bed  bott  chair  clock  dish  disp  door  ear  fauc  knife  lamp  micro  frid  stor  table  trash  vase  avg  




CrossShapeNetSubK3 
37.9  44.9  34.4  32.4  36.5  78.8  37.0  43.6  51.8  35.1  19.3  45.9  38.8  43.4  25.7  45.6  52.3  41.4 
CrossShapeNetK3  35.6  45.7  35.4  28.8  31.3  79.0  35.0  46.2  53.9  33.7  20.3  49.7  39.7  43.9  27.8  42.6  52.2  41.2 


N shapes  212  464  6400  579  201  954  245  247  708  384  2271  212  207  2303  8309  340  1104 
Upsampling point labels versus testing on higherresolution directly.
As discussed in our results section, we trained all variants of architecture and our backbone DGCNN on 2.5K points, then tested on the 10K points provided in the PartNet benchmark.
One possibility to deal with this different resolution is to simply make predictions on K points (i.e., at lower resolution), then perform a nearestneighbor upsampling to transfer the labels to K points (i.e., each point in the higherresolution shape copies the label from the nearest point in the lowresolution representation).
Another possibility is to pass the original, higherresolution test points directly to our architecture (and backbone). However, to perform this direct processing of higherresolution of point clouds, we need to perform an adjustment of the receptive field of our backbone (DGCNN). During training, our backbone uses nearest neighbors in its EdgeConv layers. At test time, to achieve a similar receptive field in the higherresolution point cloud, we increase at test time (since using the original number of neighbors would result in a smaller receptive field in the higherresolution point cloud).
Table 4 shows a comparison of these two strategies: the nearest neighbor upsampling (“CrossShapeNetK3NN”) versus testing directly on the higherresolution point cloud with receptive field adjustment (“CrossShapeNetK3”). Since the nearest neighbor upsampling depends on the initial choice of the K subsampled subset and therefore has inherent randomness, we report results averaged over test runs (where we randomly choose subsampled subsets each time for testing). Testing directly on the higherresolution point cloud with the above receptive field adjustment yields better results. Its average part mIOU is higher by than using nearest neighbor upsampling.
We observed the same trend while testing our backbone alone i.e., DGCNN with nearest neighbor upsampling vs DGCNN processing higherresolutiont point cloud at test time with the same receptive field adjustment.
width=1.0center=
bed  bott  chair  clock  dish  disp  door  ear  fauc  knife  lamp  micro  frid  stor  table  trash  vase  avg  



CrossShapeNetK3NN 
32.7  43.2  36.5  28.0  30.4  78.4  34.7  43.9  52.3  32.2  20.6  44.7  38.2  39.6  29.4  42.5  51.0  39.9 


CrossShapeNetK3 
35.6  45.7  35.4  28.8  31.3  79.0  35.0  46.2  53.9  33.7  20.3  49.7  39.7  43.9  27.8  42.6  52.2  41.2 


N shapes 
212  464  6400  579  201  954  245  247  708  384  2271  212  207  2303  8309  340  1104 
Appendix B Evaluation wrt Shape mIoU
As discussed in our results section, we emphasized the use of part IoU in our evaluation similarly to prior work, since it better reflects the labeling accuracy of finegrained parts in each shape.
For completeness, Table 5 reports the alternative shape mIoU metric. The results show that our method significantly improves the shape mIOU of our baseline model on average (), and in the majority of classes.
width=1.0center=
bed  bott  chair  clock  dish  disp  door  ear  fauc  knife  lamp  micro  frid  stor  table  trash  vase  avg  



DGCNN  22.2  57.3  38.2  35.4  46.4  77.9  36.5  49.6  54.3  28.4  26.7  44.6  42.3  42.2  34.6  44.5  74.4  44.4 
CrossShapeNetK1 
28.6  56.7  43.4  41.5  46.2  79.9  39.0  56.2  58.6  36.8  35.8  52.2  52.2  47.7  42.8  54.5  76.6  49.9 
CrossShapeNetK3  28.3  58.8  44.5  42.0  46.2  79.9  45.8  56.1  59.2  29.1  35.0  51.8  55.1  49.4  41.1  52.5  77.0  50.1 
CrossShapeNetK5  27.6  60.4  43.1  40.2  45.7  78.7  38.5  55.3  58.1  39.9  32.3  53.2  58.5  50.1  38.6  54.1  77.6  50.1 
CrossShapeNetBestVal  28.6  60.4  44.5  40.2  46.2  79.9  45.8  56.1  58.6  39.9  35.8  53.2  52.2  50.1  42.8  54.5  76.6  50.9 
N shapes 
212  464  6400  579  201  954  245  247  708  384  2271  212  207  2303  8309  340  1104 
Appendix C Implementation and training details.
Table 6 presents the layers of our “CrossShapeNetK1” variant in detail (the “CrossShapeNetK3” and “CrossShapeNetK5” variants follow the same structure). The network takes a pair of query and key shapes as inputs. For the query shape, it outputs the probability of each part label per point. Layers 25 perform EdgeConv on query and key shapes separately; layer 6 creates the DGCNN representation of the shapes; Layers 710 compute CrossShape Attention (CSA) for each EdgeConv layer; layers 1215 compute SelfShape Attention (SSA); layer 17 computes the similarity between query and key points (see Table 7); layer 18 computes similarity between query and query points (Table 7); layer 19 takes softmax over outputs of layers 1718; layer 20 takes weighted sum of crossshape attention and selfshape attention based on these values; layer 21 concatenates DGCNN and Cross/SelfShape Attention representations; layers 2227 perform MLP on constructed representations to produce part label probabilities. For CrossShapeNet we use group normalization [61] in EdgeConv layers since our batch sizes are very small (6 for K1, 3 for K3, 2 for K5).
Table 7 shows the architecture of the submodule used to determine the compatibility between two shapes (see Section 3.2, “compatibility function paragraph”). We call this submodule as “ShapeCompatibilityNet” architecture. The network takes the query and key shapes as inputs and outputs the compatibility between them which is used to weigh the Cross and SelfShape Attention. Layers 25 perform EdgeConv on query and key shapes separately; layer 6 creates DGCNN representation of shapes; Layers 7 performs a linear transformation of the previous layer; layers 810 perform max and average poolings over points and concatenate them to compute global shape descriptors; layers 1112 perform queries/keys transformations of the global descriptors; layer 13 computes similarity as scaled dot product between the query and key descriptors.
We initialize the weights for the ShapeCompatibilityNet from a model pretrained on ModelNet40 dataset [20]. We do not replace the batch normalization with group normalization here since we use a pretrained network. For additional information on our DGCNN backbone, we refer the reader to [1].
Index  Layer  out 

1  ,  N 3 
2  EdgeConv(out(1), 3, 64) + GN  N 64 
3  EdgeConv(out(2), 64, 64) + GN  N 64 
4  EdgeConv(out(3), 64, 128) + GN  N 128 
5  EdgeConv(out(3), 128, 256) + GN  N 256 
6  CAT(out(2), out(3), out(4), out(5))  N 512 
7  N 64  
8  N 64  
9  N 128  
10  N 256  
11  CAT(out(7), out(8), out(9), out(10))  N 512 
12  N 64  
13  N 64  
14  N 128  
15  N 256  
16  CAT(out(12), out(13), out(14), out(15))  N 512 
17  1 1  
18  1 1  
19  SIMCSA, SIMSSA = Softmax(out(17), out(18))  2 1 
20  SIMCSA*out(11) + SIMSSA*out(16)  N 512 
21  CAT(out(6), out(20))  N 1024 
22  FC(out(21), 1024) + ReLU  N 1024 
23  FC(out(22), 512) + ReLU + Dropout(0.5)  N 512 
24  FC(out(23), 256) + ReLU + Dropout(0.5)  N 256 
25  FC(out(24), 128) + ReLU + Dropout(0.5)  N 128 
26  FC(out(25), )  N 
27  SemanticLabel=Softmax(out(25))  N 
Index  Layer  out 

1  N 3  
2  EdgeConv(out(1), 3, 64)  N 64 
3  EdgeConv(out(2), 64, 128)  N 128 
4  EdgeConv(out(3), 128, 128)  N 128 
5  EdgeConv(out(3), 128, 256)  N 256 
6  CAT(out(2), out(3), out(4), out(5))  N 512 
7  CONV1D(out(6), 1024)  N 1024 
8  AMP(out(7))  1 1024 
9  AAP(out(7))  1 1024 
10  CAT(out(8), out(9))  1 2048 
11  1 2048  
12  1 2048  
13  SIM(query, key) =  1 1 
Training details
Here we describe the training procedure for CrossShapeNet in more detail. For optimization we use the Adam optimizer [62] with learning rate 0.001 and . We initialize our training procedure by training CrossShapeNet using a pretrained ShapeCompatibility submodule (pretrained on ModelNet40 for classification). We keep the ShapeCompatibility parameters frozen, while training the rest of the network. Then, once validation accuracy saturates, we load the best current model from our checkpoint and switch to training the ShapeCompatibility submodule, again until we observe saturation in the validation accuracy. We then update the collection shape graph (i.e., the compatible neighbors per shape). We keep training by alternating between these two phases: (a) training the CrossShapeNet layers (while keeping the rest frozen), (b) the ShapeCompatibility layers and updating the collection shape graph. Overall, we perform five such alternating training iterations: three for CrossShapeNet and two for ShapeCompatibilityNet. Figure 5 shows the evolution of the part IoU in the validation set for Clock (left) and Vases (right) in the case of “CrossShapeNetK5” during the training epochs for the above phases. There is an improvement in performance when the ShapeCompatibility network and collection graph is updated.
References
 Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 38(5) (2019)
 Liu, Y., Fan, B., Xiang, S., Pan, C.: Relationshape convolutional neural network for point cloud analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2019) 8895–8904
 Li, G., Muller, M., Thabet, A., Ghanem, B.: Deepgcns: Can gcns go as deep as cnns? In: IEEE International Conference on Computer Vision (CVPR). (2019) 9267–9276
 Jiang, L., Zhao, H., Liu, S., Shen, X., Fu, C.W., Jia, J.: Hierarchical pointedge interaction network for point cloud semantic segmentation. In: IEEE International Conference on Computer Vision (CVPR). (2019) 10433–10441
 Xu, Q., Sun, X., ying Wu, C., Wang, P., Neumann, U.: Gridgcn for fast and scalable point cloud learning. arXiv preprint arXiv:1912.02984 (2019)
 Wang, L., Huang, Y., Hou, Y., Zhang, S., Shan, J.: Graph attention convolution for point cloud semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2019) 10296–10305
 Yunxiao, S., Haoyu, F., Jing, Z., Yi, F.: Pairwise attention encoding for point cloud feature learning. In: Proceedings of the International Conference on 3D Vision (3DV). (2019)
 Kim, V.G., Li, W., Mitra, N.J., Chaudhuri, S., DiVerdi, S., Funkhouser, T.: Learning partbased templates from large collections of 3d shapes. ACM Trans. Graph. 32(4) (2013)
 Huang, H., Kalogerakis, E., Marlin, B.: Analysis and synthesis of 3d shape families via deeplearned generative models of surfaces. Computer Graphics Forum 34(5) (2015)
 Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Partnet: A largescale benchmark for finegrained and hierarchical partlevel 3d object understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2019) 909–918
 Qi, C.R., Su, H., Mo, K., Guibas, L.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017) 652–660
 Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: International Conference on Neural Information Processing Systems (NeurIPS). (2017)
 Li, J., Chen, B., Lee, G.: Sonet: Selforganizing network for point cloud analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (06 2018)
 Srivastava, N., Goh, H., Salakhutdinov, R.: Geometric capsule autoencoders for 3d point clouds. arXiv preprint arXiv:1912.03310 (2019)
 Le, E.T., Kokkinos, I., Mitra, N.J.: Going deeper with point networks. arXiv preprint arXiv:1907.00960 (2019)
 Su, H., Maji, S., Kalogerakis, E., LearnedMiller, E.: Multiview convolutional neural networks for 3d shape recognition. In: IEEE International Conference on Computer Vision (ICCV). (2015) 945–953
 Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multiview cnns for object classification on 3d data. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016) 5648–5656
 Kalogerakis, E., Averkiou, M., Maji, S., Chaudhuri, S.: 3d shape segmentation with projective convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017) 3779–3788
 Huang, H., Kalogerakis, E., Chaudhuri, S., Ceylan, D., Kim, V.G., Yumer, E.: Learning local shape descriptors from part correspondences with multiview convolutional networks. ACM Transactions on Graphics 37(1) (2017)
 Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015) 1912–1920
 Maturana, D., Scherer, S.: Voxnet: A 3d convolutional neural network for realtime object recognition. In: IEEE International Conference on Intelligent Robots and Systems (IROS), IEEE (2015) 922–928
 Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richlyannotated 3d reconstructions of indoor scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017) 5828–5839
 Rethage, D., Wald, J., Sturm, J., Navab, N., Tombari, F.: Fullyconvolutional point networks for largescale point clouds. In: European Conference on Computer Vision (ECCV). (2018) 596–611
 Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2019) 770–779
 Liu, Z., Tang, H., Lin, Y., Han, S.: Pointvoxel cnn for efficient 3d deep learning. In: Advances in Neural Information Processing Systems (NeurIPS). (2019) 963–973
 Riegler, G., Ulusoy, A.O., Geiger, A.: Octnet: Learning deep 3d representations at high resolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
 Klokov, R., Lempitsky, V.: Escape from cells: Deep KDnetworks for the recognition of 3d point cloud models. In: IEEE International Conference on Computer Vision (CVPR). (2017) 863–872
 Wang, P.S., Liu, Y., Guo, Y.X., Sun, C.Y., Tong, X.: OCNN: Octreebased convolutional neural networks for 3D shape analysis. ACM Transactions on Graphics 36(4) (2017)
 Wang, P.S., Sun, C.Y., Liu, Y., Tong, X.: Adaptive OCNN: A patchbased deep representation of 3d shapes. ACM Trans. Graph. 37(6) (2018)
 Su, H., Jampani, V., Sun, D., Maji, S., Kalogerakis, E., Yang, M.H., Kautz, J.: SPLATNet: Sparse lattice networks for point cloud processing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018)
 Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: PointCNN: Convolution on xtransformed points. In: Advances in Neural Information Processing Systems (NeurIPS). (2018) 820–830
 Hua, B.S., Tran, M.K., Yeung, S.K.: Pointwise convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018) 984–993
 Xie, S., Liu, S., Chen, Z., Tu, Z.: Attentional shapecontextnet for point cloud recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018) 4606–4615
 Liu, Y., Fan, B., Meng, G., Lu, J., Xiang, S., Pan, C.: Densepoint: Learning densely contextual representation for efficient point cloud processing. In: IEEE International Conference on Computer Vision (ICCV). (2019) 5239–5248
 Groh, F., Wieschollek, P., Lensch, H.P.A.: Flexconvolution (millionscale pointcloud learning beyond gridworlds). In: Asian Conference on Computer Vision (ACCV). (2018)
 Atzmon, M., Maron, H., Lipman, Y.: Point convolutional neural networks by extension operators. ACM Trans. Graph. 37(4) (2018)
 Hermosilla, P., Ritschel, T., Vázquez, P.P., Vinacua, A., Ropinski, T.: Monte carlo convolution for learning on nonuniformly sampled point clouds. ACM Trans. Graph. 37(6) (2018)
 Wang, S., Suo, S., Ma, W.C., Pokrovsky, A., Urtasun, R.: Deep parametric continuous convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018) 2589–2597
 Xu, Y., Fan, T., Xu, M., Zeng, L., Qiao, Y.: Spidercnn: Deep learning on point sets with parameterized convolutional filters. arXiv preprint arXiv:1803.11527 (2018)
 Wu, W., Qi, Z., Fuxin, L.: Pointconv: Deep convolutional networks on 3d point clouds. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2019) 9621–9630
 Komarichev, A., Zhong, Z., Hua, J.: ACNN: Annularly convolutional neural networks on point clouds. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2019) 7421–7430
 Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: KPConv: Flexible and deformable convolution for point clouds. In: IEEE International Conference on Computer Vision (CVPR). (2019) 6411–6420
 Shen, Y., Feng, C., Yang, Y., Tian, D.: Mining point cloud local structures by kernel correlation and graph pooling. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018) 4548–4557
 Lan, S., Yu, R., Yu, G., Davis, L.S.: Modeling local geometric structure of 3d point clouds using geocnn. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2019) 998–1008
 Wang, C., Samari, B., Siddiqi, K.: Local spectral graph convolution for point set feature learning. arXiv preprint arXiv:1803.05827 (2018)
 Zhang, K., Hao, M., Wang, J., de Silva, C.W., Fu, C.: Linked Dynamic Graph CNN: Learning on point cloud via linking hierarchical features. arXiv preprint arXiv:1904.10014 (2019)
 Landrieu, L., Boussaha, M.: Point cloud oversegmentation with graphstructured deep metric learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2019) 7440–7449
 Han, W., Wen, C., Wang, C., Li, X., Li, Q.: Point2Node: Correlation learning of dynamicnode for point cloud feature modeling. arXiv preprint arXiv:1912.10775 (2019)
 Yi, L., Su, H., Guo, X., Guibas, L.J.: SyncSpecCNN: Synchronized spectral cnn for 3d shape segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017) 2282–2290
 Boscaini, D., Masci, J., Melzi, S., Bronstein, M.M., Castellani, U., Vandergheynst, P.: Learning classspecific descriptors for deformable shapes using localized spectral convolutional networks. In: Computer Graphics Forum. Volume 34., Wiley Online Library (2015) 13–23
 Boscaini, D., Masci, J., Rodolà, E., Bronstein, M.: Learning shape correspondence with anisotropic convolutional neural networks. In: Advances in Neural Information Processing Systems (NeurIPS). (2016) 3189–3197
 Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., Bronstein, M.M.: Geometric deep learning on graphs and manifolds using mixture model cnns. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017) 5115–5124
 Wang, X., Girshick, R., Gupta, A., He, K.: Nonlocal neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018) 7794–7803
 Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: Crisscross attention for semantic segmentation. In: IEEE International Conference on Computer Vision (CVPR). (2019) 603–612
 Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: GCNet: Nonlocal networks meet squeezeexcitation networks and beyond. In: IEEE International Conference on Computer Vision (CVPR) Workshops. (2019)
 Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for imagetext matching. (2018) 201–216
 Hou, R., Chang, H., Bingpeng, M., Shan, S., Chen, X.: Cross attention network for fewshot classification. In: Advances in Neural Information Processing Systems (NeurIPS). (2019) 4005–4016
 Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: Learning feature matching with graph neural networks. arXiv preprint arXiv:1911.11763 (2019)
 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS). (2017) 5998–6008
 Ram, P., Gray, A.G.: Maximum innerproduct search using tree datastructures. arXiv preprint arXiv:1202.6101 (2012)
 Wu, Y., He, K.: Group normalization. In: European Conference on Computer Vision (ECCV). (2018) 3–19
 Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)