# Learning Hierarchical Shape Segmentation and Labeling from Online Repositories

## Abstract

We propose a method for converting geometric shapes into hierarchically segmented parts with part labels. Our key idea is to train category-specific models from the scene graphs and part names that accompany 3D shapes in public repositories. These freely-available annotations represent an enormous, untapped source of information on geometry. However, because the models and corresponding scene graphs are created by a wide range of modelers with different levels of expertise, modeling tools, and objectives, these models have very inconsistent segmentations and hierarchies with sparse and noisy textual tags. Our method involves two analysis steps. First, we perform a joint optimization to simultaneously cluster and label parts in the database while also inferring a canonical tag dictionary and part hierarchy. We then use this labeled data to train a method for hierarchical segmentation and labeling of new 3D shapes. We demonstrate that our method can mine complex information, detecting hierarchies in man-made objects and their constituent parts, obtaining finer scale details than existing alternatives. We also show that, by performing domain transfer using a few supervised examples, our technique outperforms fully-supervised techniques that require hundreds of manually-labeled models.

## 1Introduction

Segmentation and labeling of 3D shapes is an important problem in geometry processing. These structural annotations are critical for many applications, such as animation, geometric modeling, manufacturing, and search [24]. Recent methods have shown that, by supervised training from labeled shape databases, state-of-the-art performance can be achieved on mesh segmentation and part labeling [19]. However, such methods rely on carefully-annotated databases of shape segmentations, which is an extremely labor-intensive process. Moreover, these methods have used coarse segmentations into just a few parts each, and do not capture the fine-grained, hierarchical structure of many real-world objects. Capturing fine-scale part structure is very difficult with non-expert manual annotation; it is difficult even to determine the set of parts and labels to separate. Another option is to use unsupervised methods that work without annotations by analyzing geometric patterns [33]. Unfortunately, these methods do not have access to the full semantics of shapes and as a result often do not identity parts that are meaningful to humans, nor can they apply language labels to models or their parts. Additionally, typical co-analysis techniques do not easily scale to large datasets.

We observe that, when creating 3D shapes, artists often provide a considerable amount of extra structure with the model. In particular, they separate parts into hierarchies represented as *scene graphs*, and annotate individual parts with textual names. In surveying the online geometry repositories, we find that most shapes are provided with these kinds of user annotations. Furthermore, there are often thousands of models per category available to train from. Hence, we ask: can we exploit this abundant and freely-available metadata to analyze and annotate new geometry?

Using these user-provided annotations comes with many challenges. For instance, Figure (a) shows four typical scene graphs in the car category, created by four different authors. Each one has a different part hierarchy and set of parts, e.g., only two of the scene graphs have the steering wheel of the car as a separate node. The hierarchies have different depths; some are nearly-flat hierarchies and some are more complex. Only a few parts are given names in each model. Despite this variability, inspecting these models reveal common trends, such as certain parts that are frequently segmented, parts that are frequently given consistent names, and pairs of parts that frequently occur in parent-child relationships with each other. For example, the tire is often a separate part, it is usually the child of the wheel, and usually has a name like tire or RightTire. Our goal is to exploit these trends, while being robust to the many variations in names and hierarchies that different model creators use.

This paper proposes to learn shape analysis from these messy, user-created datasets, thus leveraging the freely-available annotations provided by modelers. Our main goal is to automatically discover common trends in part segmentation, labeling, and hierarchy. Once learned, our method can be applied to new shapes that consist of geometry alone: the new shape is automatically segmented into parts, which are labeled and placed in a hierarchy. Our method can also be used to clean-up the existing databases. Our method is designed to work with large training sets, learning from thousands of models in a category. Because the annotations are uncurated, sparse (within each shape) and irregular, this problem is an instance of weakly-supervised learning.

Our approach handles each shape category (e.g., cars, airplanes, etc.) in a dataset separately. For a given shape category, we first identify the commonly-occurring part names within that class, and manually condense this set, combining synonyms, and removing uninformative names. We then perform an optimization that simultaneously (a) learns a metric for classifying parts, (b) assigns names to unnamed parts where possible, (c) clusters other unnamed parts, (d) learns a canonical hierarchy for parts in the class, and (e) provides a consistent labeling to all parts in the database. Given this annotation of the training data, we then learn to hierarchically segment new models, using a Markov Random Field (MRF) segmentation algorithm. Our algorithms are designed to scale to training on large datasets by mini-batch processing. We use these outputs to train a hierarchical segmentation model. Then, given a new, unsegmented mesh, we can apply this learned model to segment the mesh, transfer the tags, and infer the part hierarchy.

We use our method to analyze shapes from ShapeNet [5], a large-scale dataset of 3D models and part graphs obtained from online repositories. We demonstrate that our method can mine complex information detecting hierarchies in man-made objects and their constituent parts, obtaining finer scale details than existing alternatives. While our problem is different from what has been explored in previous research, we perform two types of quantitative evaluations. First, we evaluate different variants of our method by holding some tags out, and show that all terms in our objective function are important to obtain the final result. Second, we show that supervised learning techniques require hundreds of manually labeled models until they reach the quality of segmentation that we get without any explicit supervision. We publicly share our code and the processed datasets in order to encourage further research.^{1}

## 2Related Work

Recent shape analysis techniques focus on extracting structure from large collections of 3D models [37]. In this section we discuss recent work on detecting labeled parts and hierarchies in shape collections.

**Shape Segmentation and Labeling.** Given a sufficient number of training examples, it is possible to learn to segment and label novel geometries [19]. While supervised techniques achieve impressive accuracy, they require dense training data for each new shape category, which significantly limits their applicability. To decrease the cost of data collection, researchers have developed methods that rely on crowdsourcing [8], active learning [35], or both [38]. However, this only decreases the cost of data collection, but does not eliminate it. Moreover, these methods have not demonstrated the ability to identify fine-grained model structure, or hierarchies. One can rely solely on consistency in part geometry to extract meaningful segments without supervision [11]. However, since these methods do not take any human input into account, they typically only detect coarse parts, and do not discover semantically salient regions where geometric cues fail to encapsulate the necessary discriminative information.

In contrast, we use the part graphs that accompany 3D models to weakly supervise the shape segmentation and labeling. This is similar in spirit to existing unsupervised approaches, but it mines semantic guidance from ambient data that accompanies most available 3D models.

Our method is an instance of weakly-supervised learning from data on the web. A number of related problems have been explored in computer vision, including learning classifiers and captions from user-provided images on the web, e.g., [16], or image searches, e.g., [6].

**Shape Hierarchies.** Previous work attempted to infer scene graphs based on symmetry [34] or geometric matching [33]. However, as with unsupervised segmentation techniques, these methods only succeed in a presence of strong geometric cues. To address this limitation, Liu et al. proposed a method that learns a probabilistic grammar from examples, and then uses it to create consistent scene graphs for unlabeled input. However, their method requires accurately labeled example scene graphs. Fisher et al. use scene graphs from online repositories, focusing on arrangements of objects in scenes, whereas we focus on fine-scale analysis of individual shapes.

In contrast, we leverage the scene graphs that exist for most shapes created by humans. Even though these scene graphs are noisy and contain few meaningful node names (Figure ?(a)), we show that it is possible to learn a consistent hierarchy by combining cues from corresponding sparse labels and similar geometric entities in a joint framework. Such label correspondences not only help our clusters be semantically meaningful, but also help us discover additional common nodes in the hierarchy.

## 3Overview

Our goal is to learn an algorithm that, given a shape from a specific class (e.g., cars or airplanes), can segment the shape, label the parts, and place the parts into a hierarchy. Our approach is to train based on geometry downloaded from online model repositories. Each shape is composed of 3D geometry segmented into distinct parts; each part has an optional textual name, and the parts are placed in a hierarchy. The hierarchy for a single model is called a scene graph. As discussed above, different training models may be segmented in different hierarchies; our goal is to learn from trends in the data as to which parts are often segmented, how they are typically labeled, and which parts are typically children of other parts.

We break the analysis into two sub-tasks:

Part-Based Analysis

(Section 4). Given a set of meshes in a specific category and their original messy scene graphs, we identify the dictionary of distinct parts for a category, and place them into a canonical hierarchy. This dictionary includes both parts with user-provided names (e.g.,

`wheel`

) and a clustering of unnamed parts. All parts on the training meshes are labeled according to the part dictionary.Hierarchical Mesh Segmentation

(Section 5). We train a method to segment a new mesh into a hierarchical segmentation, using the labels and hierarchy provided by the previous step. For parts with textual names, these labels are also transferred to the new parts.

We evaluate with testing on hold-out data, and qualitative evaluation. In addition, we show how to adapt our model to a benchmark dataset.

Our method makes two additional assumptions. First, our feature vector representations assume consistently-oriented meshes, following the representation in ShapeNetCore [5]. Second, the canonical hierarchy requires that every type of part has only one possible parent label, e.g., our algorithm might infer that the parent of a `headlight`

is always the `body`

, if this is frequently the case in the training data.

In our segmentation algorithm, we usually assume that each connected component in the mesh belongs to a single part. This can be viewed as a form of over-segmentation assumption (e.g., [33]), and we found it to be generally true for our input data, e.g., see Figure ?(b) and Figure 1. We show results both with and without this assumption in Section ? and in the Supplemental Material.

## 4Part-Based Analysis

The first step of our process takes the shapes in one category as input, and identifies a dictionary of parts for that category, a canonical hierarchy for the parts, and a labeling of the training meshes according to this part dictionary. Each input shape is represented by a scene graph: a rooted directed tree , where nodes are parts with geometric features and each edge indicates that part is a child of part . We manually pre-process the user-provided part names into a tag dictionary , which is a list of part names relevant for the input category (Table ?). One could imagine discovering these names automatically. We opted for the manual processing, since the vocabulary of words that appear in ShapeNet part labels is fairly limited, and there are many irregularities in the label usage, e.g., synonyms and mispellings. The parts with a label from the dictionary are then assigned corresponding tags . Note that many parts are untagged, either because no names were provided with the model, or the user-provided names did not map onto names in the dictionary. Note also that is indexes parts within a shape independent of tags; e.g., there is no necessary relation between and part . Each graph has a root node, which has a special root tag, and no parent. For non-leaf nodes, the geometry of any node is the union of geometries of its children.

To produce a dictionary of parts, we could directly use the user-provided tags, and then cluster the untagged parts. However, this naive approach would have several intertwined problems. First, the user-provided tags may be incorrect in various ways: missing tags for known parts (e.g., a wheel not tagged at all), tags given only at a high-level of the hierarchy (e.g., the rim and the tire are not segmented from the wheel, and they are all tagged as wheel), and tags that are simply wrong. The clustering itself depends on a distance metric, which must be learned from labels. We would like to have tags be applied as broadly and accurately as possible, to provide as much clean training data as possible for labeling and clustering, and to correctly transfer tags when possible. Finally, we would also like to use a parent-child relationships to constrain the part labeling (so that a wheel is not the child of a door), but plausible parent-child relationships are not known a priori.

We address these problems by jointly optimizing for all unknowns: the distance metric, a dictionary of parts, a labeling of parts according to this dictionary, and a probability distribution over parent-child relationships. The labeling of model parts is also done probabilistically, by the Expectation-Maximization (EM) algorithm [25], where the hidden variables are the part labels. The distance metric is encoded in a embedding function , which maps a part represented by a shape descriptor (Appendix Appendix A) to a lower-dimensional feature space. The function is represented as a neural network (Figure ?). Each canonical part has a representative cluster center in the feature space, so that a new part can be classified by nearest-neighbors distance in the feature space. Note that the clusters do not have an explicit association with tags: our energy function only encourages parts with the same tag to fall in the same cluster. As a post-process, we match tag names to clusters where possible.

We model parent-child relationships with a matrix , where is, for a part in cluster , the probability that its parent has label . After the learning stage, is converted to a deterministic canonical hierarchy over all of the parts.

Our method is inspired in part by the semi-supervised clustering method of Basu et al. . In contrast to their linear embedding of initial features for metric learning, we incorporate a neural network embedding procedure to allow non-linear embedding in the presence of constraints, and use an EM soft clustering. In addition, Basu et al. do not take hierarchical representations into consideration, whereas our data is inherently a hierarchical part tree.

### 4.1Objective function

The EM objective function is:

where are the parameters of the embedding , are the label probabilities such that represents the probability of the part of shape be assigned to label cluster, and are the unknown cluster centers. We set throughout all experiments.

The first two terms, and , encourage the concentration of clusters in the embedding space; encourages the separation of visually dissimilar parts in embedding space; is introduced to estimate the parent-child relationship matrix ; the entropy term is a consequence of the derivation of the EM objective (Appendix ?) and is required for correct estimation of probabilities. We next describe the energy terms one by one.

Our first term favors part embeddings to be near their corresponding cluster centroids:

where is the embedding function , represented as a neural network and parametrized by a vector . The network is described in Appendix A.

Second, our objective function constrains the embedding, by favoring small distances for parts that share the same input tag, and for parts that have very similar geometry:

We extract all tagged parts and sample pairs from them for the constraint. We set to a small constant to account for near-perfect repetitions of parts, and ensure that these parts are assigned to the same cluster.

Third, our objective favors separation in the embedded space by a margin between parts on the same shape that are not expected to have the same label:

We only use parts from the same shape in , since we believe it is generally reasonable to assume that parts on the same shape with distinct tags or distinct geometry have distinct labels.

Finally, we score the labels of parent-child pairs by how well they match the overall parent-child label statistics in the data, using the negative log-likelhood of a multinomial:

### 4.2Generalized EM algorithm

We optimize the objective function (Equation Equation 1) by alternating between E and M steps. We solve for the soft labeling in the E-step, and the other parameters, , in the M-step, where are the parameters of the embedding .

**E-step.** Holding the model parameters fixed, we optimize for the label probabilities :

We optimize this via coordinate descent, by iterating times over all coordinates. The update is given in Appendix .

**M-step.** Next, we hold the soft clustering fixed and optimize the model parameters by solving the following subproblem:

We use stochastic gradient descent updates for and , as is standard for neural networks, while keeping fixed. The parent-child probabilities are then computed as: where is a column-wise normalization function to guarantee . and are the cluster probability vectors that correspond to parts and of the same shape, respectively. in our experiments, to prevent cluster centers from stalling at zero. Since each column of is a separate multinomial distribution, the update in Eq. is the standard multinomial estimator.

**Mini-batch training.** The dataset for any category is far too large to fit in memory, and so, in practice, we break the learning process into mini-batches. Each mini-batch includes 50 geometric models at a time. For the set , 20,000 random pairs of parts are sampled across models in the mini-batch. 30 epochs (passes over the whole dataset) are used.

For each mini-batch, the E-step is computed as above. In the mini-batch M-step, the embedding parameters and cluster centers are updated by standard stochastic gradient descent (SGD) updates, using Adam updates [21]. For the hierarchy , we use Stochastic EM updates [4], which are more stable and efficient than gradient updates. The sufficient statistics are computed for the minibatch:

Running averages for the sufficient statistics are updated after each mini-batch: where in our experiments. Then, the estimates for are computed from the current sufficient statistics by:

**Initialization.** Our objective, like many EM algorithms, requires good initialization. We first initialize the neural network embedding with normalized initialization [10]. For each named tag , we specify an initial cluster center as the average of the embeddings of all the parts with that tag. The remaining cluster centroids are randomly sampled from a normal distribution in the embedding space. The cluster label probabiilities are initialized by a nearest-neighbor hard-clustering, and then is initialized by Eq. ?.

### 4.3Outputs

Once the optimization is complete, we compute a canonical hierarchy from by solving a Directed Minimum Spanning Tree problem, with the root constrained to the entire object. Then, we assign tags to parts in the hierarchy by solving a linear assignment problem that maximizes the number of input tags in each cluster that agree with the tag assigned to their cluster. As a result, some parts in the canonical hierarchy receive textual names from assigned tags. Unmatched clusters are denoted with generic names `cluster-0.08em_N`

. We then label each input part with its most likely node in by selecting . This gives a part labeling of each node in each input scene graph. An example of the canonical hierarchy with part names, and a labeled shape, is shown in Figure ?.

This canonical hierarchy, part dictionary, and part labels for the input scene graphs are then used to train the segmentation algorithm as described in the next section.

## 5Hierarchical Mesh Segmentation

Given the part dictionary, canonical hierarchy, and per-part labels from the previous section, we next learn to hierarchically segment and label new shapes. We formulate the problem as labeling each mesh face with one of the leaf labels from the canonical hierarchy. Because each part label has only one possible parent, all of a leaf node’s ancestors are unambiguous. In other words, once the leaf nodes are specified, it is straightforward to completely convert the shape into a scene graph, with all the nodes in the graph labeled. In order to demonstrate our approach in full generality, we assume the input shape includes only geometry, and no scene graph or part annotations. However, it should be possible to augment our procedure when such information is available.

### 5.1Unary classifier

We begin by describing a technique for training a classifier for individual faces. This classifier can also be used to classify connected components. In the next section, we build an MRF labeler from this. Our approach is based on the method of Kalogerakis et al. , but generalized to handle missing leaf labels and connected components, and to use neural network classifiers.

The face classifier is formulated as a neural network that takes geometric features of a face as input, and assigns scores to the leaf node labels for the face. The feature vector for a face consists of several standard geometric features. The neural network specifies a score function , where is a weight vector for label , and is a sequence of fully-connected layers and non-linear activation units, applied to . The score function is normalized by a softmax function to produce an output probability: where is the set of possible leaf node labels. See Appendix B for details of the feature vector and neural network.

To train this classifier, we can apply the per-part labels from the previous section to the individual faces. However, there is one problem with doing so: many training meshes are not segmented to the finest possible detail. For example, a car wheel might not be segmented into tire and rim, or the windows may not be segmented from the body. In this case, the leaf node labels are not given for each face, but only ancestor nodes are known: we do not know which wheel faces are tire faces. In order to handle this, we introduce a probability table . is the probability of a face taking leaf label if the deepest label given for this training face is . For example, is the probability that the correct leaf label for a face labeled as a wheel is tire. To estimate , we first compute the unnormalized by counting the number of faces assigned to both label and label , except that if is not an ancestor of in the canonical hierarchy. Then is determined by normalizing the columns to to sum to 1: .

We then train the classifier by minimizing the following loss function for and , the parameters of : where sums over all faces in the training shapes and is the deepest label assigned to face as discussed above. This loss is the negative log-likelihood of the training data, marginalizing over the hidden true leaf label for each training face, generalizing [16]. We use Stochastic Gradient Descent to minimize this objective.

We have also observed that meshes in online repositories are comprised of connected components, and these connected components almost always have the same label for the entire component. For most results presented in this paper, we use connected components as the basic labeling units instead of faces, in order to improve results and speed. We define the connected component classifier by aggregating the trained face classifier over all the faces of the connected component as follows:

### 5.2MRF labeler

Let be the set of leaf node of the canonical hierarchy. In the case of classifying each connected component, we want to specify one leaf node for each connected component . We define the MRF over connected component labels as:

with weight is set by cross-validation separately for each shape category and held constant across all experiments. The unary term assesses the likelihood of component having a given leaf label, based on geometric features of the component, and is given by the classifier: The edge term prefers adjacent components to have the same label. It is defined as , where is tree distance between labels and in the canonical hierarchy. This encourages adjacent labels to be as close in the canonical hierarchy as possible. For example, is 0 when the two labels are the same, whereas is 2 if they are different but share a common parent. To generate the edge set in , we connect nearest connected components with this edge, where where is the number of connected components in the mesh.

Once the classifiers and are trained, the model can be applied to a new mesh as follows. First, the leaf labels are determined by optimizing Equation 9 using the - swap algorithm [3]. Then, the scene graph is computed by bottom-up grouping. In particular, adjacent components with same leaf label are first grouped together. Then, adjacent groups with the same parent are grouped at the next level of the hierarchy, and so on.

For the case where connected components are not available, the MRF algorithm is applied for each face. The unary term is given by the face classifier . We still need to handle the case where the object is not topologically connected, and so the pairwise term applies to all faces and whose centroids fall into the -nearest neighborhood of each other, and is given by: where is the angle between the faces, is the distance between the face centroids, is the average distance between a face’s centroid and it’s nearest face’s centroid, and in all our experiments. is a scale factor to promote faces sharing an edge: (u,v)

## 6Results

## 7Discussions and Conclusion

We have proposed a novel method for mining consistent hierarchical shape models from massive but sparsely annotated scene graphs “in the wild.” As we analyze the input data, we jointly embed parts to a low-dimensional feature space, cluster corresponding parts, and build a probabilistic model for hierarchical relationships among them. We demonstrated that our model can facilitate hierarchical mesh segmentation and were able to extract complex hierarchies and identify small segments in 3D models from various shape categories. Our method can also provide a valuable boost for supervised segmentation algorithms. The goal of our current framework is to extract as much structure as possible from raw noisy, sparsely tagged scene graphs that exist in online repositories. In the future, we believe that using such freely-available information will provide enormous opportunities for shape analysis.

Developing Convolutional Neural Networks for surfaces is a very active area right now, e.g., [12]. Our segmentation training loss functions are largely agnostic to the model representation, and it ought to be straightforward to train a ConvNet on our training loss, for any ConvNet that handles disconnected components.

Though effective as evidenced by experimental evaluations, several issues are not completely addressed yet. Our model currently relies on heuristic selection of the number of clusters , and this could be chosen automatically. We could also relax the assumption that each part with a given label may have only one possible parent label, to allow more general shape grammars [30].

Our method has obtained about K 3D training models with roughly consistent segmentation, but these have not been human-verified. We also believe that our approach could be leveraged together with crowdsourcing techniques [38] to efficiently yield very large, detailed, segmented, and verified shape databases.

It would also be interesting to explore how well the information learned from one object category may transfer to other object categories. For example, “wheel” can be found in “cars” and “motorbikes”, sharing similar geometry and sub-structures. The observation provides the opportunity for not only the transfer of part embeddings but also the part relationships. With the growth of online model repositories, such transfer learning ability would be more important and relevant towards more efficient expanding of our current dataset.

## APart Features and Embedding Network

We compute per-part geometric features which are further used for joint part embedding and clustering (Section Section 4). The feature vector includes 3-view lightfield descriptor [7] (with HOG features for each view), center-of-mass, bounding box diameter, approximate surface area (fraction of voxels occupied in 30x30x30 object grid), and local frame in PCA coordinate system (represented by matrix ). To mitigate reflection ambiguities for local frame we constraint all frame axes to have positive dot product with -axis (typically up) of the global frame. For lightfield descriptor we normalize the part to be centered at origin and have bounding box diameter 1, for all other descriptors we normalize the mesh in the same way. We mitigate reflection ambiguities by constraining all frame axes to have positive dot product with the -axis of the global frame. The neural network embedding is visualized in Figure ?, and, in Table ?, we show the embedding network parameters, where we alter first few fully connected layers to allocate more neurons for richer features such as LFD.

feature | fc1 | fc2 | fc3 | concat | fc4 | fc5 | fc6 |
---|---|---|---|---|---|---|---|

LFD | 128 | 256 | 256 | 512 | 256 | 128 | 64 |

PCA Frame | 16 | 32 | 64 | ||||

CoM | 16 | 64 | 64 | ||||

Diameter | 8 | 32 | 64 | ||||

Area | 8 | 32 | 64 | ||||

## BFace Features and Classifier Network

We compute per-face geometric features which are further used for hierarchical mesh segmentation (Section Section 5). These features include spin images (SI) [18], shape context (SC) [2], distance distribution (DD) [27], local PCA (LPCA) (where are eigenvalues of local coordinate system, and features are ), local point position variance (LVar), curvature, point position (PP) and normal (PN). To compute local radius for the feature computation we sample 10000 points on the entire shape and use 50 nearest neighbors. We use the same architecture as part embedding network (Fig. ?) for face classification, but with different loss function (Eq. ?) and network parameters, which are summarized in Table ?.

feature | fc1 | fc2 | fc3 | concat | fc4 | fc5 | fc6 |
---|---|---|---|---|---|---|---|

Curvature | 32 | 64 | 64 | 640 | 256 | 128 | 128 |

LPCA | 64 | 64 | 64 | ||||

LVar | 32 | 64 | 64 | ||||

SI | 128 | 128 | 128 | ||||

SC | 128 | 128 | 128 | ||||

DD | 32 | 64 | 64 | ||||

PP | 16 | 32 | 64 | ||||

PN | 16 | 32 | 64 | ||||

## CE-step Update

In the E-step, the assignment probabilities are iteratively updated. For each node , the probability that it is assigned to label is updated as: where is set of children of node and is the parent node. A joint closed-form update to all assignments could be computed using Belief Propagation, but we did not try this.

### Footnotes

### References

**2004.**

Basu, S., Bilenko, M., and Mooney, R. J. A probabilistic framework for semi-supervised clustering.**2002.**

Belongie, S., Malik, J., and Puzicha, J. Shape matching and object recognition using shape contexts.**2001.**

Boykov, Y., Veksler, O., and Zabih, R. Fast approximate energy minimization via graph cuts.**2009.**

Cappé, O., and Moulines, E. On-line expectation-maximization algorithm for latent data models.**ShapeNet: An Information-Rich 3D Model Repository.**

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., and Yu, F., 2015. arXiv:1512.03012.**2015.**

Chen, X., and Gupta, A. Webly supervised learning of convolutional networks.**2003.**

Chen, D.-Y., Tian, X.-P., Shen, Y.-T., and Ouhyoung, M. On visual similarity based 3d model retrieval.**2009.**

Chen, X., Golovinskiy, A., and Funkhouser, T. A benchmark for 3d mesh segmentation.**2011.**

Fisher, M., Savva, M., and Hanrahan, P. Characterizing structural relationships in scenes using graph kernels.**2010.**

Glorot, X., and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks.**2009.**

Golovinskiy, A., and Funkhouser, T. Consistent segmentation of 3D models.**2015.**

Guo, K., Zou, D., and Chen, X. 3D mesh labeling via deep convolutional neural networks.**2012.**

Hu, R., Fan, L., , and Liu, L. Co-segmentation of 3d shapes via subspace clustering.**2011.**

Huang, Q., Koltun, V., and Guibas, L. Joint shape segmentation with linear programming.**2014.**

Huang, Q., Wang, F., and Guibas, L. Functional map networks for analyzing and exploring large shape collections.**2015.**

Izadinia, H., Russell, B. C., Farhadi, A., Hoffman, M. D., and Hertzmann, A. Deep classifiers from image tags in the wild.**1992.**

Jelinek, F., Lafferty, J. D., and Mercer, R. L. Basic methods of probabilistic context free grammars.**1999.**

Johnson, A. E., and Hebert, M. Using spin images for efficient object recognition in cluttered 3d scenes.**2010.**

Kalogerakis, E., Hertzmann, A., and Singh, K. Learning 3d mesh segmentation and labeling.**2013.**

Kim, V. G., Li, W., Mitra, N. J., Chaudhuri, S., DiVerdi, S., and Funkhouser, T. Learning part-based templates from large collections of 3d shapes.**2015.**

Kingma, D. P., and Ba, J. L. Adam: A method for stochastic optimization.**2016.**

Li, X., Uricchio, T., Ballan, L., Bertini, M., Snoek, C. G. M., and Bimbo, A. D. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval.**2014.**

Liu, T., Chaudhuri, S., Kim, V. G., Huang, Q.-X., Mitra, N. J., and Funkhouser, T. Creating Consistent Scene Graphs Using a Probabilistic Grammar.**2013.**

Mitra, N. J., Wand, M., Zhang, H., Cohen-Or, D., and Bokeloh, M. Structure-aware shape processing.**1998.**

Neal, R. M., and Hinton, G. E. A view of the em algorithm that justifies incremental, sparse, and other variants.**2011.**

Ordonez, V., Kulkarni, G., and Berg, T. L. Im2text: Describing images using 1 million captioned photographs.**2002.**

Osada, R., Funkhouser, T., Chazelle, B., and Dobkin, D. Shape distributions.**1980.**

Porter, M. F. An algorithm for suffix stripping.**2011.**

Sidi, O., van Kaick, O., Kleiman, Y., Zhang, H., and Cohen-Or, D. Unsupervised co-segmentation of a set of shapes via descriptor-space spectral clustering.**2012.**

Talton, J., Yang, L., Kumar, R., Lim, M., Goodman, N., and Měch, R. Learning design patterns with bayesian grammar induction.**2011.**

Tighe, J., and Lazebnik, S. Understanding scenes on many levels.**2016.**

Torresani, L. Weakly-supervised learning.**2013.**

van Kaick, O., Xu, K., Zhang, H., Wang, Y., Sun, S., Shamir, A., and Cohen-Or, D. Co-hierarchical analysis of shape structures.**2011.**

Wang, Y., Xu, K., Li, J., Zhang, H., Shamir, A., Liu, L., Cheng, Z., and Xiong, Y. Symmetry Hierarchy of Man-Made Objects.**2012.**

Wang, Y., Asafi, S., van Kaick, O., Zhang, H., Cohen-Or, D., and Chenand, B. Active co-analysis of a set of shapes.**2014.**

Xie, Z., Xu, K., Liu, L., and Xiong, Y. 3d shape segmentation and labeling via extreme learning machine.**2016.**

Xu, K., Kim, V. G., Huang, Q., Mitra, N. J., and Kalogerakis, E. Data-driven shape analysis and processing.**2016.**

Yi, L., Kim, V. G., Ceylan, D., Shen, I., Yan, M., Su, H., Lu, C., Huang, Q., Sheffer, A., and Guibas, L. A scalable active framework for region annotation in 3d shape collections.**2014.**

Yumer, M. E., Chun, W., and Makadia, A. Co-segmentation of textured 3d shapes with sparse annotations.**Thingi10k: A dataset of 10,000 3d-printing models.**

Zhou, Q., and Jacobson, A., 2016. arxiv:1605.04797.