Learning to Group and Label FineGrained Shape Components
Abstract.
A majority of stock 3D models in modern shape repositories are assembled with many finegrained components. The main cause of such data form is the componentwise modeling process widely practiced by human modelers. These modeling components thus inherently reflect some functionbased shape decomposition the artist had in mind during modeling. On the other hand, modeling components represent an oversegmentation since a functional part is usually modeled as a multicomponent assembly. Based on these observations, we advocate that labeled segmentation of stock 3D models should not overlook the modeling components and propose a learning solution to grouping and labeling of the finegrained components. However, directly characterizing the shape of individual components for the purpose of labeling is unreliable, since they can be arbitrarily tiny and semantically meaningless. We propose to generate part hypotheses from the components based on a hierarchical grouping strategy, and perform labeling on those part groups instead of directly on the components. Part hypotheses are midlevel elements which are more probable to carry semantic information. A multiscale 3D convolutional neural network is trained to extract contextaware features for the hypotheses. To accomplish a labeled segmentation of the whole shape, we formulate higherorder conditional random fields (CRFs) to infer an optimal label assignment for all components. Extensive experiments demonstrate that our method achieves significantly robust labeling results on raw 3D models from public shape repositories. Our work also contributes the first benchmark for componentwise labeling.
1. Introduction
Semantic or labeled segmentation of 3D shapes has gained significant performance boost over recent years, benefiting from the advances of machine learning techniques [Kalogerakis et al., 2010; Hu et al., 2012], and more recently of deep neural networks [Kalogerakis et al., 2017; Su et al., 2017]. Existing methods have so far been dealing with manifold meshes, point clouds, or volumes. They are, however, not specifically designed to handle most stock 3D models, which typically assembles up to hundreds of highly finegrained components (Figure 1). Multicomponent assembly is the most commonly seen data form in modern 3D shape repositories (e.g., Trimble 3D Warehouse [Tri, 2017] and ShapeNet [Chang et al., 2015]). See Figure 2(left) for the statistics of component counts in ShapeNetCore.
Multiview projective segmentation [Wang et al., 2013; Kalogerakis et al., 2017] is perhaps the most feasible approach for handling multicomponent shapes, among all existing techniques. Viewbased methods are representation independent, making them applicable to nonmanifold models. However, a major drawback of this approach is that it cannot handle shapes with severe selfocclusion. Components hidden from the surface are invisible to any view, thus cannot be labeled. Figure 3(a) shows such an example: The seats in the car are completely occluded by the car shell and thus cannot be segmented or labeled correctly by viewbased methods.
Most offtheshelf 3D models are created by human modelers in a componentbycomponent fashion. Generally, human modelers tend to have in mind a meaningful decomposition of the target object before starting. Such decomposition is inherently related to functionality, mimicking the actual production of the manmade objects, e.g., a car is decomposed into shell, hood, wheels, seats, etc. Therefore, we advocate that the segmentation of such models should not overlook the components coming with the models. Meanwhile, these components usually represent an oversegmentation – a functional part might be modeled as an assembly of multiple subparts. A natural solution to semantic segmentation thus seems to be a labeled grouping of the modeling components.
A few facts about the components of stock models, however, make their grouping and labeling especially difficult. First, the decomposition of these models is often highly finegrained. See the tiny components the bicycle model in Figure 1 contains. Taking the car models in ShapeNetCore for example, about contains over components. Second, the size of components varies significantly; see Figure 2(right). Third, different modelers may have different opinions about shape composition, making the components of the same functional part highly inconsistent across different shapes. The example in Figure 3(b) shows that the wheel parts from different vehicle models have very different composition. Due to these reasons, it is very unreliable to directly characterize the shape of individual components for the purpose of labeling.
These facts motivate us to consider larger and more meaningful elements, for achieving a robust semantic labeling of finegrained components. In particular, we propose to generate part hypotheses from the components, representing potential functional or semantic parts of the object. This is achieved by a series of effective grouping strategies, which is proven robust with extensive evaluation. Our task then becomes labeling the true part hypotheses while pruning those false ones, instead of directly labeling the individual components. Working with part hypotheses enables us to learn more informative shape representation, based on which reliable labeling can be conducted. Part hypothesis is similar in spirit to midlevel patch for image understanding which admits more discriminative descriptors than feature points [Singh et al., 2012].
To achieve a powerful part hypotheses labeling, we adopt 3D Convolutional Neural Networks (CNN) to extract features from the volumetric representation of part hypotheses. In order to learn features that capture not only local part geometry but also global, contextual information, we design a network that takes two scales of 3D volume as input. The local scale encodes the part hypothesis of interest itself, through feature extraction over the voxelization of the part within its bounding box. The global volume takes the bounding box of the whole shape as input, and encodes the context with two channels contrasting the volume occupancy of the part hypothesis itself and that of the remaining parts. The network outputs the labeling probabilities of the part hypothesis over different part categories, which are used for final labeled segmentation.
To accomplish a labeled segmentation of the whole shape, we formulate higherorder Conditional Random Fields (CRFs) to infer an optimal label assignment for each component. Our CRFbased model achieves highly accurate labeling, while saving the effort on preparing large amount of highorder relational data for training a deep model. Consequently, our design choice, combining CNNbased part hypothesis feature and higherorder CRFs, achieves a good balance between model generality and complexity.
We validate our approach on our multicomponent labeling (MCL) benchmark dataset. The multicomponent 3D models are collected from both ShapeNet and 3D Warehouse, with all components manually labeled. Our method achieves significantly higher accuracy in grouping and labeling highly finegrained components than alternative approaches. We also demonstrate how our method can be applied to finegrained part correspondence for 3D shapes, achieving stateoftheart results.
The main contributions of our paper include:

We study a new problem of labeled segmentation of stock 3D models based on the preexisting, highly finegrained components, and approach the problem with a novel solution of part hypothesis generation and characterization.

We propose a multiscale 3D CNN for encoding both local and contextual information for part hypothesis labeling, as well as a CRFbased formulation for component labeling.

We build the first benchmark for multicomponent labeling with componentwise groundtruth labels and conduct extensive evaluation over the benchmark.
2. Related Work
Shape segmentation and labeling is one of the most classical and longstanding problems in shape analysis, with numerous methods having been proposed. Early studies [Katz and Tal, 2003; Huang et al., 2009; Shapira et al., 2010; Zhang et al., 2012; Au et al., 2012] most utilize handcrafted geometry features. One geometric feature usually captures very limited aspects about shape decomposition and a wider practiced approach is to combine multiple features [Kalogerakis et al., 2010].
To tackle the limitation of handcrafted features, datadriven feature learning methods are proposed [Xu et al., 2016]. Guo et al. [2015] learned a compact representation of triangle for 3D mesh labeling by nonlinearly combining and hierarchically compressing various geometry features with the deep CNNs. Xie et al. [2014] proposed a fast method for 3D mesh segmentation and labeling based on Extreme Learning Machine. Yi et al. [2017b] proposed a method, named SyncSpecCNN, to label the semantic part of 3D mesh. SyncSpecCNN trains vertex functions using CNNS, and conducts spectral analysis to enable kernel weight sharing by using localized information of mesh graphs. These methods achieve promising performance, while largely focusing on manifold and/or watertight surface mesh, but not suited for raw 3D models from modern shape repositories.
Recently, Kalogerakis et al. [2017] proposed a deep architecture for segmenting and labeling semantic parts of 3D shape by combining multiview fully convolutional networks and surfacebased CRFs. Projectionbased methods [Wang et al., 2013; Kalogerakis et al., 2017] are suitable for imperfect (e.g., incomplete, selfintersecting, and noisy) 3D shapes, but inherently have a hard time on shapes with severe selfocclusion. Su et al. [2017] designed a novel type of neural network, named PointNet, for directly segmenting and labeling 3D point clouds while respecting the permutation invariance, obtaining stateoftheart performance on point data.
Several unsupervised or semisupervised methods are proposed for the cosegmentation and/or colabeling of a collection of 3D shapes belonging to the same category [Xu et al., 2010; Huang et al., 2011; Sidi et al., 2011; Wang et al., 2012; Hu et al., 2012; Lv et al., 2012; van Kaick et al., 2013]. Most of these methods are based on an oversegmentations of the input shapes. A grouping process is then conducted to form semantic segmentation and labeling. Such initial oversegments (e.g., superfaces) are analogy to our ‘part hypotheses’. However, they are still too low level to capture meaningful part information. Our method benefits from the preexisting finegrained components, which makes part hypothesis based analysis possible.
Few works studied on semantic segmentation of multicomponent models. Liu et al. [2014] proposed to label and organize 3D scenes obtained from the Trimble 3D Warehouse into consistent hierarchies capturing semantic and functional substructures. The labeling is based on oversegmentation of the 3D input, and guided by a learned probabilistic grammar. Yi et al. [2017a] proposed a method of converting the scenegraph of a multicomponent shape into segmented parts by learning a categoryspecific canonical part hierarchy. Their method achieves finegrained component labeling, while scene graphs are not always available.
3. Method
Please refer to Figure 4 for an overview of our algorithm pipeline. In the next, we describe the three algorithmic components, including part hypothesis generation, part hypothesis classification and scoring, and part composition inference and component labeling.
3.1. Generating part hypothesis
Part hypothesis
In our work, a semantic part, or part for short, refers to a semantically independent or functionally complete group of components. A part hypothesis is a component group which potentially represents a semantic part. When searching for a part hypothesis, we follow two principles. Firstly, a part hypothesis should cover as many as possible components of the corresponding groundtruth part. Secondly, the component coverage of a part hypothesis should be conservative, meaning that a hypothesis with missing components is preferred over that encompassing components across different semantic parts.
Grouping strategy
It is a nontrivial task to generate part hypotheses meeting the above requirements exactly. It is very likely that there is not a single optimal criterion that can be applied to generate hypotheses for any semantic part from a set of components. For example, the many components of a car wheel can seemingly grouped based on a compactness criterion. For bicycle chain, however, the tiny chain links are not compactly stacked at all, for which size based grouping might be more appropriate. Therefore, we design a grouping strategy encompassing three heuristic criteria, which are intuitively interpretable and computationally efficient. The grouping for each criterion is performed in a bottomup fashion, based on a nested hierarchy. After that, a hypotheses selection process is conducted to ensure a conservative hypothesis coverage. The quality of generated part hypotheses is evaluated in Section 4.4.
Through a statistical analysis (Figure 5), we found that most semantic parts are spatially compact, such as a door or a wheel of a car. We thus define a criterion called Center Distance, denoted by , to measure the compactness between two components and . It measures the distance between the barycenters of convex hull of two components, and encourages grouping of components which are spatially close to each other.
The second criterion, sharing the similar intuition as center distance, imposes a stronger test on compactness. This is motivated by the fact that the components in a functional part are typically tightly assembled. The Geometric Contact criterion, denoted as , prioritizes the grouping of components with large area of geometric contact. Let and be the volume of component and , respectively, and be the contact volume. The criterion is defined as the maximum of the ratio between contact volume and component volume:
Here, the volume of a component can be computed by counting the voxels occupied by the component in a global voxelization of the entire shape. The contact volume between two components can be computed as the number of overlapping voxels of the two components in the global voxelization.
The semantic parts of a 3D model can be of arbitrary size, ranging from a rearview mirror to the entire cab for a car; see the supplementary material. Thus for each part category, we sample a set of candidate proposals with varying sizes, to avoid missing the best one. We design a third criterion Group Size, denoted by , as the occupancy rate of the joint volume of component and over the volume of the whole shape. This criterion is used to control the grouping, sampling groups first in small size and then to large.
Hierarchical hypothesis sampling
We employ a hierarchical aggregation algorithm to generate part hypotheses. This is motivated by the fact that most offtheshelf 3D models are assembled with components in a hierarchical manner. Given a shape, the sampling process starts from the input set of components, and groups in a greedy, bottomup manner. At each time, the pair of adjacent components with the smallest grouping criterion measure are grouped into a new node. The process is repeated until reaching the root of the hierarchy and performed for each criterion separately. Figure 6 illustrates the grouping process for each grouping criterion separately. Nodes shaded in grey color in the hierarchies represent a sampled part hypothesis. The top few groups in each hierarchy is then selected to form the candidate set for the given shape, which will be discussed in the next.
Selection of part hypotheses
We first simply sort the hypotheses, corresponding to nodes in a hierarchy, based on their grouping order. Higher level nodes imply larger coverage, while lower ones correspond to smaller regions. To prevent the selection from overly favoring hypotheses large coverage, we introduce random factors into the selection process, in a similar spirit to [Carreira and Sminchisescu, 2012; Manen et al., 2014; van de Sande et al., 2011]. In particular, the initial sorting is perturbed by multiplying the sorted indices with a random number in , and then resorting based on the resulting numbers. Finally, the top hypotheses are selected for each hierarchy, thus yielding hypotheses in total. In Figure 7, we visualize a few part hypotheses corresponding to some semantic parts for a bicycle model.
The Intersection of Union (IoU) based recall (the recall rate for a given IoU threshold w.r.t. groundtruth) of the hypothesis selection is given in Figure 8. It shows that our hierarchical sampling and selection method is quite effective in capturing the potential semantic parts, even for complicated structures such as chairs, bicycles and helicopters. Note that although the recall rates drop significantly around the IoU threshold of for several categories, it does not hurt the performance since the recall rates for IoU of are already high enough for the following CRFbased labeling algorithm to perform well. Results in Figure 12 show that the labeling accuracy is stable with the number of sampled part hypotheses and proposals (our default choice) are sufficient for all categories.
Remarks on design choice
For part hypothesis generation, we opted for combinatorial search with a hierarchical guidance rather than a learning based approach. This is because the significantly varying number and size of components within a semantic part make it extremely difficult for, e.g., a CNN model, to capture the shape geometry with a fixed input resolution. Taking the part hypotheses in a bicycle model (Figure 7) for example, the component size and count differ greatly from part to part. On the other hand, a proper resolution of data representation for CNN is unknown before the part hypothesis is extracted – a chickenandegg problem! We have implemented and compared with a CNNbased method as a baseline, demonstrating clear advantage of our approach (Section 4.2 and 4.4).
3.2. Classifying and Scoring of Part hypotheses
We train a neural network to classify a part hypothesis and produce a confidence score for it representing the confidence of the hypothesis being an independent semantic part. To achieve that, we first build a training dataset of multicomponent 3D models with componentwise labels. We then design a multiscale Convolutional Neural Networks (CNNs), which learns feature representation capturing not only local part geometry but also global context.
Training data
The multicomponent 3D models used for training are collected from both ShapeNet [Chang et al., 2015] and 3D warehouse [Tri, 2017]. The data comes from the training part of our multicomponent labeling benchmark (see Section 4.1). Each model in the training set has a componentwise labeling, based on the semantic labels defined with WordNet. An overview of the humanlabeled models from the dataset can be found in the supplementary material. The training set contains eight object categories and two scene categories on which the statistics are detailed in Table 1.
Input data representation
To train a CNN model for part hypothesis labeling, we opt for volumetric representation of part hypotheses as input, similar to [Wu et al., 2015]. To achieve a multiscale feature learning, we represent each hypothesis in three scales including a local scale based on a voxelization of its bounding box, a global scale, which takes the volume of the bounding box of the entire shape and contributes two channels. One channel encodes the volume occupancy of the part hypothesis itself and the other accounts for the context based on the occupancy of the remaining parts. For each scale, the volume resolution is fixed to . To avoid the global alignment among all shapes, we opt for training with many possible orientations of each shape. In practice, we use the upright orientation of each shape and enumerate its four canonical orientations (Manhattan frames).
Data augmentation for balanced training
The hierarchical grouping of part hypotheses could make the training data unbalanced: Insufficient data is sampled for semantic categories containing small number of components (e.g., rearview mirrors of cars). This will make our CNN model inadequately trained for these categories. To cope with this issue, we opt to synthesize more training data, for the categories with insufficient instances, based on the groundtruth of semantic parts in the training data. Specifically, we pursue two ways for data augmentation.

Component deletion. Given a groundtruth semantic part, we randomly delete a few components and use the incomplete part as a training example. We typically remove up to components.

Component insertion. Given a groundtruth semantic part, we randomly insert a few components from the neighboring parts to form a training example. We stipulate that the newly added components do not exceed of original ones.
Groundtruth labels and scores
We next compute a groundtruth part label and confidence score for each training part hypothesis, used for training our network for both label prediction and score regression. For a given part hypothesis, if its components labeled with a certain category occupy over of the global voxelization of the entire shape, it is treated as a positive example for that category, and negative otherwise. For each hypothesis, we first compute its 3D Intersection of Union (IoU) against each groundtruth semantic part of the shape, in a global voxelization of . The highest IoU is set as its confidence score. This score measures the confidence of a hypothesis being an independent semantic part, which will be utilized in the final label inference in Section 3.3.
Network architecture
We design a multiscale Convolutional Neural Networks (CNNs). The architecture of our network is given in Figure 9. The network has three towers, taking the inputs corresponding to the local and the two global channels mentioned above. We refer to these towers as local, global and contextual, respectively; see Figure 9. The feature maps output by the three towers are concatenated into one feature vector, and then fed into a few fully connected layers, yielding a feature vector. The final fully connected layer predicts a label and regresses a score for the input part hypothesis. In particular, our network produces a probability distribution over part labels, , with label being null label. An unrecognized part is assigned with a null label. The second output is confidence score . We use a joint loss for both hypothesis classification and score regression:
(1) 
where is crossentropy loss for label . and is smooth loss.
Rows  Vehicle  Bicycle  Chair  Cabinet  Plane  Lamp  Motor  Helicopter  Living room  Office  

1  # Avg. components  649  572  31  53  111  17  188  178  197  276 
2  # Semantic labels  9  10  11  3  10  3  9  3  8  7 
3  # Part hypotheses ()  1000  1000  200  200  1000  200  1000  1000  200  200 
4  # Train / # Test  26 / 83  30 / 68  33 / 81  25 / 57  17 / 78  30 / 70  22 / 87  21 / 84  30 / 72  30 / 72 
5  # Training hypo.  23787  28947  2149  4666  4812  1662  12800  8263  7090  14475 
6  Baseline (Random Forest)  54.7  58.9  62.4  65.9  53.5  63.3  65.9  52.8  47.7  68.5 
7  Baseline (CNN Classifier)  48.9  63.8  70.75  63.3  68.9  81.2  67.4  78.5  51.2  63.9 
8  Baseline (CNN Hypo. Gen.)  56.3  51.9  68.5  45.7  58.5  71.1  53.1  72.2  58.6  69.1 
9  PointNet [Su et al., 2017]  24.3  30.6  68.6  21.0  47.2  46.3  35.8  32.6     
10  PointNet++ [Qi et al., 2017]  51.7  53.8  69.3  62.0  53.9  79.8  62.2  79.3     
11  Guo et al. [2015]  27.1  25.2  34.2  68.8  38.6  79.1  41.6  80.1  33.7  28.5 
12  Yi et al. [2017a]  65.2  63.0  61.9  70.6  59.3  82.2  67.5  78.9  56.6  68.6 
13  Ours (w/o score)  71.5  66.8  72.5  76.5  71.4  87.6  70.7  81.2  63.3  60.1 
14  Ours (local only)  50.4  52.4  60.4  68.6  61.3  73.5  60.4  78.5  62.7  54.8 
15  Ours (local+global)  69.2  67.3  68.6  75.4  69.1  79.2  67.2  82.6  68.3  76.4 
16  Ours ()  52.0  43.2  63.5  62.0  47.6  76.5  41.7  42.4  54.6  70.7 
17  Ours ()  56.5  49.9  67.0  66.6  55.4  84.0  51.7  43.4  63.1  70.1 
18  Ours ()  59.3  54.9  70.5  69.6  59.8  86.3  55.3  50.7  64.7  68.9 
19  Ours ()  62.0  61.9  72.6  74.1  68.6  86.9  62.4  75.6  66.6  66.1 
20  Ours ()  73.7  68.1  74.3  78.7  76.5  88.3  71.7  83.3  66.1  65.4 
3.3. Composite Inference and Labeling
Given the confidence scores of a sampled part hypothesis, the final stage of our method is to infer an optimal label assignment for each component. Given a multicomponent 3D model, denoted by , which comprises a set of components . Each component is associated with a random variable which takes a value from the part label set . Let denote the set of all part hypotheses. A part hypothesis is denoted by a set of components, , and its labeling is represented a vector of random variables , with being the label assignment for component . A possible label assignment to all components, denoted by , is called a labeling for model .
We construct a higherorder Conditional Random Fields (CRFs), to find the optimal labeling for all components, based on the part hypothesis analysis from the previous steps:
(2) 
where the first term is the unary potential for each component and the second term is the higher order consistency potential defined with each hypothesis. The parameter is used to tune the importance of the two terms. We set in all our experiments. The CRFbased labeling is illustrated in Figure 10,
Unary potential
Suppose be the set of part hypotheses containing component . The unary potential is defined as:
(3) 
where is the probability of taking the label , and is defined as:
(4) 
where is the classification probability of hypothesis against label , output of our hypothesis classification network. is the top number of part hypotheses selected for computing the probability, based on the regressed confidence score for . is the confidence score for , regressed by our network. is a weight computed as the ratio between the volume of component and that of hypothesis .
Higher order consistency potential
The goal of our CRFbased labeling is to resolve the inconsistency between different part hypotheses and compute a consistent componentwise labeling, resulting in a nonoverlapping partition of all components. To this end, we design a higher order consistency potential [Kohli et al., 2008; Park and Gould, 2003], based on the label purity of part hypotheses:
(5) 
where . is the number of components constituting part hypothesis . counts the number of random variables corresponding to hypothesis which takes label . is the truncation parameter which controls the rigidity of the higher order consistency potential, and is set to in all our experiments. This means that up to of ’s components can take an arbitrary label. , with being the label purity of a part hypothesis. The purity can be computed as the entropy of the classification probability output of our network:
(6) 
where is the classification probability of hypothesis against label output, again, by our network.
The consistency potential encourages components belonging to one part hypothesis to take the same label. However, it does not impose a hard constraint on label consistency by allowing a portion of the components within a part hypothesis to be labeled freely. This is achieved by the linear truncated cost over the number of inconsistent labels. This mechanism enables the components within a part hypothesis to be assigned to different labels, so that an optimal label assignment to all components could be found through compromising among all part hypotheses. The objective in Equation (2) can be efficiently optimized with the alphabeta moving algorithm [Kohli et al., 2008].
An alternative approach to CRFbased labeling would be formulating it as a deep learning model. The combination of CRF and deep neural networks has shown promising results on semantic segmentation of 2D images [Zheng et al., 2015]. In the context of 3D component grouping and labeling, however, training such a deep model requires a large amount of relational data between different components, which is highly laborious. We believe our solution achieves a good balance between model generality and complexity.
4. Results and Evaluations
4.1. MultiComponent Labeling benchmark
To facilitate quantitative evaluation, we construct the first benchmark dataset with humanannotated, componentwise labels, named multicomponent labeling benchmark, or MCL benchmark for short. The multicomponent 3D models are collected from ShapeNet [Chang et al., 2015] and 3D warehouse [Tri, 2017], in which most 3D models are in the form of multicomponent assembly. We manually annotate each model in our dataset by assigning a semantic category to each component, using our interactive annotation tool. The annotation tool is elaborated in the supplementary material. The semantic part categories are defined based on WordNet, which are summarized with an overview of the benchmark dataset in the supplementary material. Some statistics of the dataset are also given therein.
Row 1 and 2 of Table 1 provides a summary and detailed statistics about our MCL benchmark dataset. For each category, about models are used for training, and the remaining for testing. Such a training/testing split is fixed all subsequent experiments. A few metrics on segmentation accuracy are defined to support quantitative evaluation of component labeling; see the following subsections for details. In the supplementary material, we provide an overview of the benchmark. We believe this benchmark would benefit more future research on componentwise shape analysis and datadriven shape modelling [Sung et al., 2017; Li et al., 2017].
4.2. Labeling performance
We evaluate our semantic labeling based on our MCL benchmark. The performance is measured by average Intersection of Union (avg IoU). The results, reported in the last row of Table 1, show that our approach achieves the best performance. In Figure 16, we show visually the labeling results. Our approach is able to produce robust labeling for finegrained components with complex structure and severe selfocclusions.
We also test our method on the INRIA GAMMA 3D Mesh Database [GAM, 2017], which is a large collection of human created 3D models. Our method, trained on the our MCL benchmark dataset, is applied to INRIA GAMMA database. Figure 11 presents some labeling results on a few sample models, produced by our method. More results can be found in the supplementary material.
Comparison with baseline (random forest)
To verify the effectiveness of our part hypothesis based analysis and multiscale CNN based labeling, we implement a baseline using conventional method, i.e., handcrafted features plus random forest classification. Specifically, we extract features, including light field descriptor [Chen et al., 2010], spherical harmonic descriptor [Kazhdan et al., 2003], volume ratio and bounding box diameter, for each component, and feed them into a random forest classifier for component classification. We used the default parameter settings of the standard MATLAB toolbox for random forest, with the number of trees being . The comparison is shown in Table 1 (row 6). Our midlevel, part hypothesis analysis (the last row) significantly outperforms this alternative.
Comparison with baseline (CNNbased classification)
The second baseline we compare to is a direct CNNbased component classification, without part hypothesis based analysis. Taking labeled components as training samples, we learn the network with the same architecture in Figure 9, except that only a classification loss, , in Equation (1) is employed. The performance is reported in Table 1 (row 7). Our hypothesislevel analysis (the last row) achieves much higher labeling accuracy than componentlevel analysis, due to the fact that part hypotheses capture richer semantic information than individual components.
Comparison with baseline (CNNbased hypothesis generation)
To demonstrate the difficulty of part hypothesis generation from finegrained components with drastically varying numbers and sizes, we implement a CNNbased hypothesis generation through extending Fast RCNN [Girshick, 2015] to 3D volumetric representation. The network architecture and its detailed explanation can be found in the supplemental material. Taking the volumetric representation of a shape as input, the network is trained to predict at each voxel a 3D box representing part hypothesis. This is followed by another network for joint classification and refinement of the hypothesis regions. The training data utilize the groundtruth parts in our MCL dataset after voxelization. The results shown in Table 1 (row 8) are inferior to those of our method. The main reason is that the significant scale variation of components makes it difficult for volumetric representation to characterize their shape and structure. This justifies our design choice of hierarchical search for part hypothesis generation.
Comparison to stateoftheart methods
We compare our approach with the methods in [Yi et al., 2017a] and [Guo et al., 2015], both of which adopt multiple traditional features as inputs to train neural networks. For the shapes in our dataset, we compute both facelevel and componentlevel geometric features, based on the original implementation of the two works. Details on the features can be found in the two original papers respectively. Note, however, the work [Yi et al., 2017a] is able to produce hierarchical labeling while our method is not designed for this goal. To make the two methods comparable, we compare our labels to those of only leaf nodes produced by [Yi et al., 2017a].
Our method is also compared with PointNet [Su et al., 2017] and PointNet++ [Qi et al., 2017], two stateoftheart deep learning based methods for semantic labeling of point clouds. We apply these methods by sampling the surface of the test shapes, while keeping the semantic labeling, resulting in about points for each shape. To ensure a good performance of the two methods on our dataset and a fair comparison, we used their models pretrained on ShapeNet and finetuned them on our training dataset.
We report percategory IoU percentage of these four methods on our benchmark dataset, see Table 1. The results demonstrate the significant advantage of our part hypothesis analysis approach, with consistently more accurate labeling. In particular, our method significantly outperforms [Yi et al., 2017a] on all categories and is comparable on ‘office’. The significance is high (pvalue ) for models with severe selfocclusions such as vehicles, cabinets, motors, etc., and moderately high (pvalue ) for category ‘bicycle’ and ‘lamp’. Another notable observation is that, all the alternative methods, especially PointNet and PointNet++, find a hard time in dealing with scene models. Scenes typically have more complicated structures due to the loose spatial coupling between objects. Our method, on the other hand, is able handle structures in various scales and forms, ranging from individual objects to compound scenes.
4.3. Parameter analyses and ablation studies
Parameter
When performing component inference and labeling (Section 3.3), the number of topranked part hypotheses, denoted by , selected for each component in defining the unary potential (Equation (4)) is an important parameter of our method. We experiment the parameter settings being set to and respectively, while keeping all other parameters unchanged. means to use all part hypotheses of (i.e., ). The results of percategory average IoU are shown in row 1620 of Table 1. For object categories, the best performance is obtained when using for each component. For scene categories (the last two columns), however, leads to better performance. This is because, for scene categories, the top ranked hypotheses, corresponding to the early groupings emerged in the hierarchical sampling process, are usually the individual objects in the scene. Such groups occur more frequently and hence more reliable to capture. The subsequent groupings, however, imply larger scale, interobject structures. Since the spatial relationships between objects are usually loose, as we have pointed out earlier, such structures are less reliable (hard to learn), especially when the grouping scale becomes very large.
Labeling performance over part hypothesis count
We also evaluate our method with the varying number of part hypotheses generated. Figure 12 shows the plots of avg IoU over the number of part hypotheses. The test is performed on six object categories of our benchmark dataset and the results on more categories can be found in the supplementary material. The same goes for all the plots shown in this paper. Generally, the performance grows as the number of hypotheses increases, but stops growing at a specific number. For all categories, we choose the hypothesis count no greater than , even for structurally complicated categories such as vehicle and bicycle. This shows that our approach is insensitive to the initial number of part hypotheses.
Labeling performance without confidence score
For each part hypothesis, a confidence score is regressed by our network, which measures how likely it represents an independent semantic part. This score is employed in defining the unary potential in the global labeling inference. To test its effect, we experiment an ablated version of our method without considering this confidence score (by setting in Equation (4)), while keeping all other parameters unchanged. The experimental results are reported in row 13 of Table 1. For all categories, our method works better when incorporating confidence score. In particular, the improvement of average IoU over ‘w/o score’ ranges from to .
4.4. Evaluation on part hypothesis generation
Part hypothesis quality vs. hypothesis count
In Figure 13, we study part hypothesis quality over the change of the number of sampled hypotheses. Part hypothesis quality is measured as follows: For the sampled hypotheses whose IoU is greater than , we compute their recall rate over the groundtruth semantic parts. We find that the hypothesis quality (recall rate) grows rapidly as the number of hypotheses increases, becomes stable fast at a moderate hypothesis count. For complex categories (e.g., bicycle, vehicle, motor, office), the count is lower smaller , For other categories (e.g., cabinets, chair, lamp, living room), on the other hand, the number is no greater . These numbers show that our sampling algorithm produces high quality hypotheses with a moderate sampling size, much smaller than that of exhaustive enumeration.
Comparison to alternatives
We assess the quality of part hypotheses by comparing our hierarchical grouping algorithm with two alternative methods. The first method is the CNNbased hypothesis generation we have mentioned above. The second one learns for each shape category a Gaussian Mixture Model (GMM) modeling the position and scale distribution of bounding boxes of semantic parts. Given an input shape, the method generates part hypotheses by sampling the GMM of the corresponding shape category. To make the comparison, we generate the same number of top hypotheses for all methods. While our method samples the top number of hypotheses according to the hierarchical sampling order (Section 3.1), GMM samples based on probability. For CNNbased method, we use all hypotheses generated. We plot in Figure 14 the curves of recall rate over average IoU for the three methods. It can be observed that our method produces the highest quality part hypotheses. The GMMbased method is a probabilistic sampling based approach which is fuzzy and cannot produce hypotheses with accurate boundaries.
4.5. Network analysis
To evaluate our network design, we study the effect of each of the three towers, local, global, and contextual, of our multiscale CNNs. Specifically, we train and test networks with two different combinations, ’local only’ and ’local+global’. In Table 1 (row 14 and 15), we show the results when different combinations of the towers are employed. Our multiscale CNNs architecture (with all three towers) results in the best performance for object categories, while the ’local+global’ architecture leads to higher performance for scene categories. This can be explained, again, by the fact that indoor scenes, even from the same category, have loosely defined structure and possess much layout variation. This makes it difficult to learn its global structure reliably, even with the help of contextual tower, when training data is not sufficiently large. Since our method is not targeted to scene parsing, we leave this for future work. Nevertheless, our method still obtains acceptable accuracy for scenes, verifying the ability of our method in handling structures across a large range of scales.
4.6. Training and testing time
Our implementation is built on top of the Caffe [Jia et al., 2014] framework based on the standard settings. We used Adam [Kingma and Ba, 2014] stochastic optimization for training with a minibatch of size . The initial learning rate is . The numbers of training samples are listed in Table 1 (Row 5). Training takes about minutes per iteration, and about hours per shape category. Testing our CNN network consumes about seconds per part hypothesis. The whole task takes about seconds per shape. Table 2 shows the timing for various algorithmic components. Runtime computations were performed using a Nvidia GTX 1080 GPU and a 4 core Intel i75820K CPU machine.
Technical component  Objects  Scenes 

Hypothesis generation  5.2  5.6 
CNN testing  10.5  4.5 
Higherorder CRF  7.4  3.6 
4.7. An application to shape correspondence
We apply our approach to componentlevel shape correspondence, which has been studied in [Alhashim et al., 2015] and [Zhu et al., 2017]. Given two multicomponent shapes, we first use our method to group and label the components for each shape. Based on the semantic labeling, we find a global alignment for the local canonical frames of the two shapes. This is achieved by minimizing the spatial distance between every two components with the same label, each from one of the two shapes. Given a pair of semantic parts with the same label, each from one of the two shapes being matched, we align their bounding boxes and find correspondence for their enclosed components. In particular, we find for each component from one shape the spatially closest counterpart from the other shape. After the bidirectional search and a postprocessing of conflict resolving (always keep the closer one if there are multiple matches), we return for each component from one shape a unique matching component from the other shape.
We compare our simple method with the two stateofthearts, [Alhashim et al., 2015] and [Zhu et al., 2017], on the benchmark dataset GeoTopo [Alhashim et al., 2015]. Table 3 reports the results of precision and recall for component matching. Our method achieves comparable results to theirs, and performs better on categories with higher number of semantic parts, such as airplane and velocipedes. Note that our method does not consider any highlevel structural information such as symmetry.
Category  GeoTopo  DDS  Our  

P  R  P  R  P  R  
Chair  0.69  0.67  0.83  0.83  0.71  0.75 
Table  0.63  0.61  0.81  0.86  0.79  0.80 
Bed  0.60  0.62  0.78  0.81  0.75  0.72 
Airplane  0.60  0.68  0.80  0.85  0.85  0.91 
Velocipedes  0.47  0.44  0.43  0.49  0.48  0.55 
5. Conclusion
We have studied a new problem of labeled segmentation of offtheshelf 3D models based on the preexisting, highly finegrained components. We approach the problem with a novel solution of part hypothesis analysis. The core idea of our approach is exploiting part hypothesis as a midlevel representation for semantic composition analysis of 3D shapes. This leads a highly robust labeling algorithm which can handle highly complicated structures in various scales and forms. Our work contributes, to the best of our knowledge, the first componentwise labeling algorithm that simultaneously works for single objects and compound scenes.
The success of our method is due to three key features: First, part hypotheses are generated in a principled way, based on a bottomup hierarchical grouping process, guided by three intuitive criteria. Second, a deep neural network is trained to encode part hypothesis, rather than components, accounting for both local geometric and global contextual information. Third, the higher order potential in our CRFbased formulation adopts a soft consistency constraint, providing more degree of freedom in optimal labeling search.
Limitations, failure cases and future works
Our approach has several limitations, which point out directions for future investigation. First, our current solution only groups the components but not further segment them, it thus cannot handle the case where the components are undersegmented with respect to semantic parts. Figure 15(a) shows two examples of such failure case. For such examples, a correct labeling cannot be obtained without a further breaking of this component. According to our statistics, only about shapes in ShapeNet have such issue, based on our own set of semantic labels. As a future work, we would consider incorporating componentlevel segmentation into our framework. Figure 15(b) shows another type of failure case. When the bounding boxes of two part hypotheses overlap significantly, due to, for example, shape concavity, their labeling can be misleading. Currently, our method does not produce hierarchical part grouping and labeling, as in [Yi et al., 2017a]. It would be interesting to investigate extending our hypothesis analysis for the task of hierarchical segmentation. For example, how to determine the order of grouping, or the structure of the hierarchy, is a nontrivial task. Another worthy topic for future research is the integration of CRF in the deep neural networks to make the entire model endtoend trainable while avoiding relying on strong supervision [Kalogerakis et al., 2017].
Acknowledgements
We thank the anonymous reviewers for their valuable comments. The authors are grateful to Hao Su for fruitful discussion, and Zheyuan Cai and Yahao Shi for the help on data preparation. This work was supported in part by NSFC (61572507, 61532003, 61622212, 61502023 and U1736217). Kai Xu is also supported by a visiting research scholarship offered by China Scholarship Council (CSC) and Princeton University.
Footnotes
 ccs: Computing methodologies Shape analysis
 copyright: none
 copyright: acmcopyright
 journal: TOG
 journalyear: 2018
 journalvolume: 37
 journalnumber: 6
 article: 1
 publicationmonth: 11
 doi: 10.1145/3272127.3275009
References
 2017. 3D Warehouse. https://3dwarehouse.sketchup.com/. (2017). Accessed: 20170518.
 2017. GAMMA mesh database. https://www.rocq.inria.fr/gamma/gamma/download/download.php. (2017). Accessed: 20170910.
 Ibraheem Alhashim, Kai Xu, Yixin Zhuang, Junjie Cao, Patricio Simari, and Hao Zhang. 2015. Deformationdriven topologyvarying 3D shape correspondence. Acm Transactions on Graphics 34, 6 (2015), 236.
 Oscar KinChung Au, Youyi Zheng, Menglin Chen, Pengfei Xu, and ChiewLan Tai. 2012. Mesh Segmentation with ConcavityAware Fields. IEEE Transactions on Visualization and Computer Graphics (TVCG) 18, 7 (2012), 1125–1134.
 João Carreira and Cristian Sminchisescu. 2012. CPMC: Automatic Object Segmentation Using Constrained Parametric MinCuts. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 34, 7 (2012), 1312–1328.
 Angel X. Chang, Thomas Funkhouser, Leonidas J. Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015. ShapeNet: An InformationRich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR]. Stanford University — Princeton University — Toyota Technological Institute at Chicago.
 DingYun Chen, XiaoPei Tian, YuTe Shen, and Ming Ouhyoung. 2010. On Visual Similarity Based 3D Model Retrieval. Computer Graphics Forum (Proc. Eurographics) 22, 3 (2010), 223–232.
 Ross Girshick. 2015. Fast RCNN. arXiv preprint arXiv:1504.08083 (2015).
 Kan Guo, Dongqing Zou, and Xiaowu Chen. 2015. 3D Mesh Labeling via Deep Convolutional Neural Networks. ACM Transactions on Graphics 35, 1 (2015), 3:1–3:12.
 Ruizhen Hu, Lubin Fan, and Ligang Liu. 2012. CoSegmentation of 3D Shapes via Subspace Clustering. Computer Graphics Forum (Proc. SGP) 31, 5 (2012), 1703–1713.
 Qixing Huang, Vladlen Koltun, and Leonidas J. Guibas. 2011. Joint Shape Segmentation with Linear Programming. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 30, 6 (2011), 125:1–125:12.
 Qixing Huang, Martin Wicke, Bart Adams, and Leonidas J. Guibas. 2009. Shape Decomposition using Modal Analysis. Computer Graphics Forum (Proc. Eurographics) 28, 2 (2009), 407–416.
 Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe:Convolutional Architecture for Fast Feature Embedding. (2014), 675–678.
 Evangelos Kalogerakis, Melinos Averkiou, Subhransu Maji, and Siddhartha Chaudhuri. 2017. 3D shape segmentation with projective convolutional networks. In Proc. CVPR, Vol. 1. 8.
 Evangelos Kalogerakis, Aaron Hertzmann, and Karan Singh. 2010. Learning 3D Mesh Segmentation and Labeling. ACM Transactions on Graphics (Proc. SIGGRAPH) 29, 4 (2010), 102:1–102:12.
 Sagi Katz and Ayellet Tal. 2003. Hierarchical Mesh Decomposition using Fuzzy Clustering and Cuts. ACM Transactions on Graphics (Proc. SIGGRAPH) 22, 3 (2003), 954–961.
 Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. 2003. Rotation Invariant Spherical Harmonic Representation of 3D Shape Descriptors. In Proc. SGP. 156–164.
 Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. Computer Science (2014).
 Pushmeet Kohli, L’ubor Ladický, and Phillip H. S. Torr. 2008. Graph Cuts for Minimizing Robust Higher Order Potentials. Technical Report. Oxford Brookes University Uk.
 Jun Li, Kai Xu, Siddhartha Chaudhuri, Ersin Yumer, Hao Zhang, and Leonidas Guibas. 2017. GRASS: Generative Recursive Autoencoders for Shape Structures. ACM Transactions on Graphics (Proc. of SIGGRAPH) 36, 4 (2017), 52.
 Tianqiang Liu, Siddhartha Chaudhuri, Vladimir G. Kim, QiXing Huang, Niloy J. Mitra, and Thomas Funkhouser. 2014. Creating Consistent Scene Graphs Using a Probabilistic Grammar. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 33, 6 (2014), 211:1–211:12.
 Jiajun Lv, Xinlei Chen, Jin Huang, and Hujun Bao. 2012. Semisupervised Mesh Segmentation and Labeling. Computer Graphics Forum (Proc. Pacific Graphics) 31, 7 (2012), 2241–2248.
 Santiago Manen, Matthieu Guillaumin, and Luc Van Gool. 2014. Prime Object Proposals with Randomized Prim’s Algorithm. In Proc. IEEE International Conference on Computer Vision (ICCV). 2536–2543.
 Kyoungup Park and Stephen Gould. 2003. On Learning HigherOrder Consistency Potentials for Multiclass Pixel Labeling. In Proc. European Conference on Computer Vision (ECCV). 202–215.
 Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. arXiv preprint arXiv:1706.02413 (2017).
 Lior Shapira, Shy Shalom, Ariel Shamir, Daniel CohenOr, and Hao Zhang. 2010. Contextual Part Analogies in 3D Objects. International Journal of Computer Vision (IJCV) 89, 12 (2010), 309–326.
 Oana Sidi, Oliver van Kaick, Yanir Kleiman, Hao Zhang, and Daniel CohenOr. 2011. Unsupervised CoSegmentation of a Set of Shapes via DescriptorSpace Spectral Clustering. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 30, 6 (2011), 126:1–126:10.
 Saurabh Singh, Abhinav Gupta, and Alexei A. Efros. 2012. Unsupervised Discovery of Midlevel Discriminative Patches. In European Conference on Computer Vision. arXiv:cs.CV/1205.3137 http://arxiv.org/abs/1205.3137
 Hao Su, Charles Ruizhongtai Qi, Kaichun Mo, and Leonidas J. Guibas. 2017. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). to appear.
 Minhyuk Sung, Hao Su, Vladimir G. Kim, Siddhartha Chaudhuri, and Leonidas Guibas. 2017. ComplementMe: WeaklySupervised Component Suggestions for 3D Modeling. ACM Transactions on Graphics (Proc. of SIGGRAPH Asia) 36, 6 (2017).
 Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, and Arnold W. M. Smeulders. 2011. Segmentation as Selective Search for Object Recognition. In Proc. IEEE International Conference on Computer Vision (ICCV). 1879–1886.
 Oliver van Kaick, Kai Xu, Hao Zhang, Yanzhen Wang, Shuyang Sun, Ariel Shamir, and Daniel CohenOr. 2013. CoHierarchical Analysis of Shape Structures. ACM Transactions on Graphics (Proc. SIGGRAPH) 32, 4 (2013), 69:1–69:10.
 Yunhai Wang, Shmulik Asafi, Oliver van Kaick, Hao Zhang, Daniel CohenOr, and Baoquan Chen. 2012. Active CoAnalysis of a Set of Shapes. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 31, 6 (2012), 165:1–165:10.
 Yunhai Wang, Minglun Gong, Tianhua Wang, Daniel CohenOr, Hao Zhang, and Baoquan Chen. 2013. Projective Analysis for 3D Shape Segmentation. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 32, 6 (2013), 192:1–192:12.
 Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D ShapeNets: A Deep Representation for Volumetric Shapes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1912–1920.
 Zhige Xie, Kai Xu, and Ligang Liu abd Yueshan Xiong. 2014. 3D Shape Segmentation and Labeling via Extreme Learning Machine. Computer Graphics Forum (Proc. SGP) 33, 5 (2014), 85–95.
 Kai Xu, Vladimir G Kim, Qixing Huang, Niloy Mitra, and Evangelos Kalogerakis. 2016. Datadriven shape analysis and processing. In SIGGRAPH ASIA 2016 Courses. ACM, 4.
 Kai Xu, Honghua Li, Hao Zhang, Daniel CohenOr, Yueshan Xiong, and ZhiQuan Cheng. 2010. Stylecontent separation by anisotropic part scales. ACM Transactions on Graphics (TOG) 29, 6 (2010), 184.
 Li Yi, Leonidas Guibas, Aaron Hertzmann, Vladimir G. Kim, Hao Su, and Ersin Yumer. 2017a. Learning Hierarchical Shape Segmentation and Labeling from Online Repositories. SIGGRAPH (2017).
 Li Yi, Hao Su, Xingwen Guo, and Leonidas J. Guibas. 2017b. SyncSpecCNN: Synchronized Spectral CNN for 3D Shape Segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). to appear.
 Juyong Zhang, Jianmin Zheng, Chunlin Wu, and Jianfei Cai. 2012. Variational Mesh Decomposition. ACM Transactions on Graphics 31, 3 (2012), 21:1–31:15.
 Shuai Zheng, Sadeep Jayasumana, Bernardino RomeraParedes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. 2015. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1529–1537.
 Chenyang Zhu, Renjiao YI, Wallace LIRA, Ibraheem ALHASHIM, Kai XU, and Hao ZHANG. 2017. Deformationdriven shape correspondence via shape recognition. Acm Transactions on Graphics 36, 4 (2017), 51.