Learning Material-Aware Local Descriptors for 3D Shapes
Material understanding is critical for design, geometric modeling, and analysis of functional objects. We enable material-aware 3D shape analysis by employing a projective convolutional neural network architecture to learn material-aware descriptors from view-based representations of 3D points for point-wise material classification or material-aware retrieval. Unfortunately, only a small fraction of shapes in 3D repositories are labeled with physical materials, posing a challenge for learning methods. To address this challenge, we crowdsource a dataset of 3D shapes with part-wise material labels. We focus on furniture models which exhibit interesting structure and material variability. In addition, we also contribute a high-quality expert-labeled benchmark of shapes from Herman-Miller and IKEA for evaluation. We further apply a mesh-aware conditional random field, which incorporates rotational and reflective symmetries, to smooth our local material predictions across neighboring surface patches. We demonstrate the effectiveness of our learned descriptors for automatic texturing, material-aware retrieval, and physical simulation. The dataset and code will be publicly available.
Materials and geometry are two essential attributes of objects that define their function and appearance. The shape of a metal chair is quite different from that of a wooden one for reasons of robustness, ergonomics, and manufacturability. While recent work has studied the analysis and synthesis of 3D shapes [32, 48], no prior work directly addresses the inference of physical (as opposed to appearance-driven) materials from geometry.
Jointly reasoning about materials and geometry can enable important applications. Many large online repositories of 3D shapes have been developed , but these lack tags that associate object parts with physical materials which hampers natural queries based on materials, e.g., predicting which materials are commonly used for object parts, retrieving objects composed of similar materials, and simulating how objects behave under real-world physics. Robotic perception often needs to reason about materials: a glass tumbler must be manipulated more gently than a steel one, and a hedge is a softer emergency collision zone for a self-driving car than a brick wall. The color channel may be unreliable (e.g., at night), and the primary input is geometric depth data from LiDAR, time-of-flight, or structured light scanners. Interactive design tools can suggest a feasible assignment of materials for fabricating a modeled shape or indicate when a choice of materials would be unsuitable.
A key challenge in these applications is to reason about plausible material assignments from geometry alone, since color textures are often (in model repositories) or always (in night-vision robotics or interactive design) absent or unreliable. Further, material choices are guided by functional, aesthetic, and manufacturing considerations. This suggests that material assignments cannot be inferred simply from physical simulations, but require real-world knowledge.
In this paper, we address these challenges with a novel method to compute material-aware descriptors of 3D points directly from geometry. First, we crowdsource per-part material labels for 3D shapes. Second, we train a projective convolutional network  to learn an embedding of geometric patches to a material-aware descriptor space. Third, we curate a benchmark of shapes with expert-labeled material annotations on which our material descriptors are evaluated.
Learning surface point descriptors for 3D shape data has been explored in previous approaches for 3D shape segmentation  and correspondences [50, 18]. However, there are challenges with such an approach for our task. First, 3D shape datasets are limited in size. While the largest image dataset with material labels comprises 437K images , there is no shape dataset with material labels. Second, many 3D shapes in repositories have missing, non-photorealistic, or inaccurate textures (e.g., a single color for the whole shape). Material metadata is rarely entered by designers for 3D models. Therefore, it is difficult to automatically infer material labels. Third, gathering material annotations is a laborious task, especially for workers who do not have a strong association of untextured models with corresponding real-world objects. We address these challenges by designing a crowdsourcing task that enables effective collection of material annotations.
Our contributions are the following:
The first large database of 3D shapes with per-part physical material labels, comprising a smaller expert-labeled benchmark set and a larger crowdsourced training set, and a crowdsourcing strategy for material labeling of 3D shapes.
A new deep learning approach for extracting material-aware local descriptors of surface points of untextured 3D shapes, along with a symmetry-aware CRF to make material predictions more coherent.
Prototype material-aware applications that use our descriptors for automatic texturing, part retrieval, and physical simulation.
2 Previous work
We review work on material prediction for shapes and images, as well as deep neural networks for shapes.
Material prediction for shapes.
To make material assignment accessible to inexperienced users Jain \etal  proposed a method for automatically predicting reflective properties of objects. This method relies on a database of 3D shapes with known reflective properties, and requires data to be segmented into single-material parts. It uses light-field shape descriptors  to represent part geometry and a graphical model to encode object or scene structure. Wang \etal  proposed a method for transferring textures from object images to 3D models. This approach relies on aligning a single 3D model to a user-specified image and then uses inter-shape correspondence to assign the texture to more shapes. Chen \etal  used color and texture statistics in images of indoor scenes to texture a 3D scene. They require the scene to be segmented into semantic parts and labeled. All these methods focus on visual attributes and appeal of the final renderings. In our work we focus on classifying physical materials, and do not assume any additional user input such as an image that matches a 3D shape, or a semantic segmentation. Savva \etal  construct a large repository of 3D shapes with a variety of annotations, including category-level priors over material labels (e.g., “chairs are on average 42% fabric, 40% wood, 12% leather, 4% plastic and 2% metal”) obtained from annotated image datasets . The priors are not specific to individual shapes. Chen \etal  gather natural language descriptions for 3D shapes that sometimes include material labels (“This is a brown wooden chair”), but there is no fine-grained region labeling that can be used for training. Yang \etal  propose a data-driven algorithm to reshape shape components to a target fabrication material. We aim to produce component-independent material descriptors that can be used for a variety of tasks such as classification, and we consider materials beyond wood and metal.
Material prediction for images.
Photographs are a rich source of the appearance of objects. Image-based material acquisition and measurement has been an active area for decades; a comprehensive study of image-based measurement techniques can be found in . Material prediction “in the wild”, i.e., in uncontrolled non-laboratory settings, has recently gained more interest fueled by the availability of datasets like the Flickr Materials Database [27, 38], Describable Textures Dataset , OpenSurfaces , and Materials in Context . Past techniques identified features such as gradient distributions along image edges , but recently deep learning has set new records for material recognition ([13, 6]). In our work, we focus on renderings of untextured shapes rather than photographs of real world scenes.
Deep neural networks for shape analysis.
A variety of neural network architectures have been proposed for both global (classification) and local (segmentation, correspondences) reasoning about 3D shapes. The variety of models is in large part due to the fact that unlike 2D images, which are almost always stored as raster grids of pixels, there is no single standard representation for 3D shapes. Hence, neural networks based on polygon meshes [31, 7], 2D renderings [41, 20, 18], local descriptors after spectral alignment , unordered point sets [45, 25, 34, 35, 40], canonicalized meshes , dense voxel grids [47, 14, 33], voxel octrees [36, 43], and collections of surface patches , have been developed. Bronstein \etal  provide an excellent survey of spectral, patch and graph-based approaches. Furthermore, methods such as  have been proposed for dense shape correspondences. Our goal is to learn features that reflect physical material composition, rather than representations that reflect geometric or semantic similarity. Our specific architecture derives from projective, or multi-view convolutional networks for local shape analysis [20, 18], which are good at identifying fine-resolution features (e.g., feature curves on shapes, hard/smooth edges) that are useful to discriminate between material classes. However, our approach is conceptually agnostic to the network used to process shapes, and other volumetric, spectral, or graph-based approaches could be used instead.
3 Data Collection
We collected a crowd-sourced training set of 3D shapes annotated with per-component material labels. We also created a benchmark of 3D shapes with verified material annotations to serve as ground-truth. Both datasets will be made publicly available.
3.1 3D Training Shapes
Our 3D training shapes originate from the ShapeNet v2 repository . We picked shapes from three categories with interesting material and structural variability: chairs, tables and cabinets. To crowd-source reliable material annotations for these shapes, we further pruned the original shapes as follows.
First, observing that workers are error-prone on texture-less shapes, we removed shapes that did not include any texture references. These account for of the original shapes. Second, to avoid relying on crowd workers for tedious manual material-based mesh segmentation, we only included shapes with pre-existing components (i.e., groups in their OBJ format). We also removed over-segmented meshes ( components), since these tended to have tiny parts that are too laborious to label. Meshes without any, or with too many components accounted for an additional of the original shapes. Third, to remove low-quality meshes that often resulted in rendering artifacts and further material ambiguity, we pruned shapes with fewer than triangles/vertices (another of the dataset). Finally, after removing duplicates, the remaining shapes were chairs, tables, and cabinets, summing to a total of shapes to be labelled. To gather material annotations for the components of these shapes, we created questionnaires released through the Amazon Mechanical Turk (MTurk) service. Each questionnaire had queries (see supplementary for interface). Four different rendered views covering the front, sides and back of the textured 3D shape were shown. At the foot of the page, a single shape component was highlighted. Each query highlighted a different component. Workers were asked to select a label from a set of materials for the highlighted component. The set of materials was wood, plastic, metal, glass, fabric (including leather). We selected this set to cover materials commonly found in furniture available in ShapeNet. We deliberately did not allow workers to select multiple materials to ensure they picked the most plausible material given the textured component rendering. We also provided a “null” option, with associated text “cannot tell / none of the above” so users could flag components whose material they found impossible to guess. Our original list of materials also included stone, but workers chose this option only for a small fraction of components (0.5%), and thus we excluded it from training and testing. Preliminary versions of the questionnaire showed that users often had trouble telling apart metal from plastic components. We believe the reason was that metal and plastic components sometimes have similar texture and color. Thus, we provided an additional option “metal or plastic”. We note that our training procedure utilized this multi-material option, which is still partially informative as it excludes other materials.
Out of the queries in each questionnaire, of them were “sentinels” i.e., components whose correct material was clearly visible, unambiguous and confirmed by us. We used these sentinels to check worker reliability. Users who incorrectly labelled any sentinel or selected “null” for more than half of the questions were ignored. In total, workers participated in our data collection out of which were deemed as “unreliable”. All workers were compensated with for completing a questionnaire. On average, minutes were spent per questionnaire.
Each component received answers (votes). The distribution of votes is shown in Figure 2(a). If or out of votes for a component agreed, we considered this a consensus vote. components received such consensus, of which components had consensus on the “null” option. Thus, components (out of i.e., ) acquired material labels. We further checked and included components with transparent textures and confirmed they were all glass. In total, we collected labeled components in shapes. The distribution of material labels is shown in Figure 2(b). For training, we kept only shapes with a majority of components labeled ( shapes).
3.2 3D Benchmark Shapes
The 3D benchmark shapes originated from Herman Miller’s online catalog  and 3D Warehouse . All shapes were stored as meshes and chosen because they had explicit references to product names and descriptions from a corresponding manufacturer: IKEA  or Herman Miller. This dataset has chairs, tables and cabinets. Expert annotators assigned material labels to all shape components through direct visual reference to corresponding manufacturers’ product images as well as information from the textual product descriptions. Such annotation is not scalable, hence this dataset is relatively small and used purely for evaluation. See supplementary for distribution of labeled parts.
4 Network Architecture and Training
Our method trains a convolutional network that embeds surface points of 3D shapes in a high-dimensional descriptor space. To perform this embedding, our network learns “material-aware” descriptors for points through a multi-task optimization procedure.
4.1 Network architecture
To learn material-aware descriptors, we use the architecture visualized in Figure 1. The network follows a multi-view architecture [20, 18]. Other architectures could also be considered, e.g. volumetric [47, 50], spectral [31, 7, 8], or point-based [34, 40].
We follow Huang \etal’s  multi-view architecture. We render images around each surface point with a Phong shader and a single directional light along the viewing axis. The rendered images depict local surface neighborhoods around each point from distances of 0.25, 0.5 and 1.0 times the shape’s bounding sphere radius. The camera up vectors are aligned with the shape’s upright axis, as we assume shapes to be consistently upright-oriented. The viewpoints are selected to maximize surface coverage and avoid self-occlusion . In Huang \etal’s architecture , the images per point are processed through AlexNet branches  Because view-based representations for 3D shapes are somewhat similar to 2D images, we chose to use GoogLeNet  instead, which achieved strong results for 2D material recognition . Alternatives like VGG  yielded no notable differences. We tried rendering views as in Huang \etal’s work, but since ShapeNet shapes are upright-oriented, we found that upright-oriented views were equivalent.
In our GoogLeNet-based MVCNN, we aggregate the 1024D output from the 7x7 pooling layer after “inception 5b” for each of our views with a max view-pooling layer . This aggregated feature is reduced to a 512D descriptor. A subsequent classification layer and sigmoid layer compute classification scores. For training, all parameters are initialized with the trained model from , except for the dimensionality reduction layer and classification layer whose parameters are initialized randomly from a Gaussian distribution with mean and standard deviation .
Structured material predictions.
Figure 3 visualizes the per-point material label predictions for a characteristic input mesh. Note that self-occlusions and non-discriminative views can cause erroneous predictions. Further, symmetric parts (e.g., left and right chair legs) lack consistency in material predictions, since long-range dependencies between surface points are not explicitly considered in our network. Finally, network material predictions are limited only to surface points, and not throughout the whole shape.
To address these challenges, the last part of our architecture incorporates a structured probabilistic model, namely a Conditional Random Field  (CRF). The CRF models both local and long-range dependencies in the material predictions across the input surface represented as a polygon mesh, and also projects the point-based predictions onto the input mesh. We treat the material predictions on the surface as binary random variables. There are such variables per input polygon, each indicating the presence/absence of a particular material. Note that this formulation accommodates multi-material predictions.
Our CRF incorporates: (a) unary factors that evaluate the probability of polygons to be labeled according to predicted point material labels, (b) pairwise factors that promote the same material label for adjacent polygons with low dihedral angle, (c) pairwise factors that promote the same material label for polygons whose geodesic distance is small, (d) pairwise factors that promote the same material label for polygons related under symmetry. Specifically, given all surface random variables for an input shape , the joint distribution is expressed as follows:
where is the binary variable indicating if face is labeled with material , and is a normalization constant. The unary factor sets the label probabilities of the surface point nearest to face according to the network output. The pairwise factors encode pairwise interactions between adjacent faces, following previous CRFs for mesh segmentation . Specifically, we define a factor favoring the same material label prediction for neighboring polygons with similar normals. Given the angle between their normals ( is divided by to map it between ), the factor is defined as follows:
where and represent the binary labels for adjacent faces , and are learned factor- and material-dependent weights. The factors favor similar labels for polygons , which are spatially close (according to geodesic distance ) and also belong to the same connected component:
where the weights and are learned factor- and material-dependent parameters, and represents the geodesic distance between and , normalized to .
Finally, our CRF incorporates symmetry-aware factors. We note that such symmetry-aware factors were not considered before in other CRF-based mesh segmentation approaches. Specifically, our factors favor similar labels for polygons , which are related under a symmetry. We detect rotational and reflective symmetries between components by matching surface patches through ICP, extracting their mapping transformations, and grouping them together when they undergo a similar transformation following Lun \etal[28, 29]. The symmetry-aware factors are expressed as:
where the weights and are learned factor- and label-dependent parameters, and expresses the Euclidean distance between face centers after applying the detected symmetry.
Exact inference in this probabilistic model is intractable. Thus we use mean-field inference to approximate the most likely joint assignment to all random variables (Algorithm 11.7 of ). Figure 3 shows material predictions over the input mesh after performing inference in the CRF with and without symmetry factors.
To train the network, we sample evenly-distributed surface points from each of our 3D training shapes. Points lacking a material label, or externally invisible, are discarded. The remaining points are subsampled to per shape to fit memory constraints. The network is trained end-to-end with a multi-task loss function that includes a multi-class binary cross-entropy loss for material classification and a contrastive loss  to align 3D points in descriptor space  according to their underlying material (Figure 1 (right)). Specifically, given: (i) a set of training surface points from 3D shapes, (ii) a “positive” set consisting of surface point pairs labeled with the same material label, (iii) a “negative” set consisting of surface point pairs that do not share any material labels, (iv) binary indicator values per training point and label (equal to 1 when is labeled with label , otherwise), the network parameters are trained according to the following multi-task loss function:
The loss function is composed of the following terms:
where represents the probability of our network to assign the material to the surface point according to its descriptor , measures squared Euclidean distances between the normalized image and surface point descriptors, and is a margin typically used in constrastive loss (we set it to ). The loss terms have weights which were selected empirically to balance the terms to have same order of magnitude during training time. We will refer to the network optimized with this loss as “Multitask”. We also experiment with a variant that utilizes solely classification loss. In this case . We will refer to this network as “Classification”. Note that in both Multitask and Classification, the classification layer is trained with an effective loss weight of . For Multitask, the learning rate multiplier of the classification layer is increased to compensate for .
Multitask training is performed with Adam  with learning rate , . The network is trained in cycles of iterations. We choose a stopping point when losses converge on a validation set, which occurs by the end of the second cycle. In order to optimize for the contrastive loss, the network is trained in a Siamese fashion with two branches that share weights (see Fig. 1 right). Classification training is performed through stochastic gradient descent with momentum. The initial learning rate is set to and momentum is set to . The learning rate policy is polynomial decay with power . weight decay is set to . We train Classification for two cycles of iterations. In the second cycle, the initial learning rate is reduced to and momentum is reduced to . Note that variants of both optimization procedures were tried for both loss functions and that we only report the optimal settings here. We also note that we tried contrastive-only loss but it did not perform as well as the variants here.
During training, point pairs are sampled from and with a 1:4 ratio with the intuition that learning to separate different classes in descriptor space is more difficult than grouping the same class together. To balance material classes, we explicitly cycle through all material pair combinations when sampling pairs. For example, if we sample a negative wood-glass pair, subsequent negative pairs will not be wood-glass until all other combinations have been sampled. Because it is possible for points to have multiple ground truth labels (e.g. metal or plastic), we ensure that negative pairs do not share any ground truth labels. For example, if we try to sample a plastic-metal pair and we draw a metal or plastic point paired with a metal point, this pair would be discarded and re-sampled until a true negative plastic-metal pair is drawn. On a Pascal Titan X GPU, training with batchsize takes about hours per iterations.
The CRF module is trained to maximize the log-likelihood of the material labelings in our training meshes  on average:
where are ground-truth binary material labels per polygon in the training shape from our training shape set . We use gradient descent and initialize the CRF weights to . Training takes 8 hours on a Xeon E5-2630 v4 processor.
We evaluate our approach in a series of experiments. For our test set, we sample K evenly-distributed surface points from each of benchmark test shapes. We discard externally invisible points, and evenly subsample the rest to points per shape. Our final test set consists of K points. See supplementary for distribution of points.
Mean precision for nearest neighbor retrievals is computed by averaging the number of neighbors that share a ground truth label with the query point over the number of retrieved points (). Nearest neighbors are retrieved in descriptor space from a class-balanced subset of training points. Mean precision by class is computed by computing the mean precision for subsets of the test set containing only test points that belong to the class of interest. Table 1 summarizes the mean precision at varying values of . Both Classification and Multitask variations achieve similar mean class precision at all values of . Furthermore, note that the Multitask variation achieves better precision than the Classification variation over all values of in every class except for wood. We believe this is likely because the contrastive loss component of multi-task loss encourages distance between clusters of dissimilar classes while classification-only loss encourages the clusters to be separable without necessarily being far apart. Therefore it is less likely for Multitask descriptors to have nearby neighbors from a different cluster.
To demonstrate that our descriptors are useful for material classification, we evaluate the learned classifier on our test shapes. We measure the top-1 accuracy per material label. The top-1 accuracy of a label prediction for a given point is 1 if the point has that label according to ground-truth, and 0 otherwise. If the point has multiple ground-truth labels, the accuracy is averaged over them. The top-1 accuracy for a material label is computed by averaging this measure over all surface points. These numbers are summarized in Table 2.
We note that both Classification and Multitask variations produce similar mean class top-1 accuracies. However, the Classification variation exhibits a larger variance in its top-1 class accuracies, with better wood accuracy in exchange for worse glass and fabric accuracies compared to the Multitask variation. After applying the CRF, both variations have improved top-1 accuracies for all classes except for glass. Glass prediction accuracy remains almost the same for the Multitask variation, but drops drastically for Classification. We suspect that this occurs because glass parts sometimes share similar geometry with wooden parts in furniture (for example, flat tabletops or flat cabinet doors may be made of either glass or wood). In this case, several point-wise predictions will compete for both glass and wood. If more of these predictions are wood rather than glass, it is likely the CRF will smooth out the predictions towards wood, which will result in performance drop for glass predictions. Fig. 12 shows top-1 prediction confusion matrices. Wood points are often predicted correctly, yet sometimes are confused with metal. Glass points are often confused with wood points. Fabric is occasionally confused with plastic or wood. These confusions often happen for chairs with thin leather backs or thin seat cushions. Plastic is occasionally confused with metal. This is due to parts that are thin rounded cylinders often used in both metal and plastic-made components. Furthermore, the proportion of plastic labels to “metal or plastic” labels is low in our training dataset, which makes the learning less reliable in the case of plastic. In both variations, there is a bias towards wood predictions. This is likely due to the abundance of wooden parts in real-world furniture designs reflected in our datasets. However, the bias is less pronounced in the Multitask variation. Thus we believe that the Multitask variation is better for a more balanced generalization performance across classes.
Effect of Number of Views.
To study the effect of the number of views, we train the MVCNN with 3 views (1 viewpoint, 3 distances) and compare to our results above with 9 views (3 viewpoints, 3 distances): see Table 3. Multiple viewpoints are advantageous.
5.2 Material-aware Applications
We illustrate the utility of the material-aware descriptors learned by our method in some prototype applications.
Given the material-aware segmentation produced by our method, we can automatically texture a 3D mesh based on the predicted material of its faces. Such a tool can be used to automate texturing for new shapes or for collectively texturing existing shape collections. If the mesh does not have UV coordinates, we generate them automatically by simultaneous multi-plane unwrapping. Then, we apply a texture to each mesh face according to the physical material predicted by the material-aware segmentation. We have designed representative textures for each of the physical materials predicted by our method (wood, plastic, metal, glass, fabric). Resulting renderings for a few of the meshes from our test set can be seen in Figure 5.
Retrieval of 3D parts.
Given a query 3D shape part from our test set, we can search for geometrically similar 3D parts in the training dataset. However, retrieval based on a geometric descriptor can return parts with inconsistent materials (see Figure 6(a)) whereas a designer might want to find geometrically similar parts with consistent materials (e.g. to replace the query part or its texture with a compatible database part) In Figure 6(b), we show retrieval results when we use both a geometric descriptor along with a simple material compatibility check. Our pipeline is used to obtain the material label for the untextured query part. Then, we retrieve geometrically similar parts from our training set whose crowdsourced material label agrees with the predicted one. For our prototype, we used the multi-view CNN of Su \etal  to compute geometric descriptors of parts.
Our material prediction pipeline allows us to perform simulation-based analysis of raw geometric shapes without any manual annotation of density, elasticity or other physical properties. This kind of visualization can be useful in interactive design applications to assist designers as they create models. In Figure 7, we show a prototype application which takes as input an unannotated polygon mesh, and simulates the effect of a downward force on it assuming the ground contact points are fixed. The material properties of the shape are predicted using a lookup table which maps material labels predicted by our method to representative density and elasticity values. We use the Vega toolkit  to select the force application region and deform the mesh under a downward impulse of 4800 Ns evenly distributed over this area. For this prototype, we ignore fracture effects and internal cavities, and assume the material is perfectly elastic. An implicit Newmark integrator performs finite element analysis over a voxelized () version of the shape. The renderings in Figure 7 show both the local surface strain (area distortion) as well as the induced deformation of shapes with different material compositions.
We presented a supervised learning pipeline to compute material-aware local descriptors for untextured 3D shapes, and developed the first crowdsourced dataset of 3D shapes with per-part physical material labels. Our learning method employs a projective convolution network in a Siamese setup, and material predictions inferred from this pipeline are smoothed by a symmetry-aware conditional random field. Our dataset uses a carefully designed crowdsourcing strategy to gather reasonably reliable labels for thousands of shapes, and an expert labeling procedure to generate ground truth labels for a smaller benchmark set used for evaluation. We demonstrated prototype applications leveraging the learned descriptors, and are placing the dataset in the public domain to drive future research in material-aware geometry processing.
Our work is a first step and has several limitations. Our experiments have studied only a small set of materials, with tolerably discriminative geometric differences between their typical parts. Our projective architecture depends on rendered images and can hence process only visible parts of shapes. Also, our CRF-based smoothing is only a weak regularizer and cannot correct gross inaccuracies in the unary predictions. Addressing these limitations would be promising avenues for future work.
We believe that the joint analysis of physical materials and object geometry is an exciting and little-explored direction for shape analysis and design. Recent work on functional shape analysis  has been driven by priors based on physical simulation, mechanical compatibility or human interaction. Material-aware analysis presents a rich orthogonal direction that directly influences the function and fabricability of shapes. It would be interesting to combine annotations from 2D and 3D datasets to learn better material prediction models. It would also be interesting to reason about parametrized or fine-grained materials, such as different types of wood or metal, with varying physical properties. As a driving application, interactive modeling tools that provide continuous material-aware feedback on the shape being modeled could significantly aid real-world design tasks. Finally, there is significant scope for developing “material palettes” – probabilistic models of material co-use that take into account many intersectional design factors such as function, aesthetics, manufacturability and cost.
We acknowledge support from NSF (CHS-1617861, CHS-1422441, CHS-1617333). We thank Olga Vesselova for her input to the user study and Sourav Bose for help with FEM simulation of furniture.
7 Supplementary Material
This supplementary material is organized as follows. First we show the data collection interface and discuss additional statistics that may be of interest (section 7.1). Second, we discuss some additional training details (section 7.2). Third, we show statistics for our test set (section 7.3). Fourth, we discuss in detail the 2D classification baseline that we used in our evaluation (section 7.4). Fifth, we visualize embedding plots via t-SNE for our learned descriptor space (section 7.5). Sixth, we show confusion matrices for both Classification and Multitask networks (section 7.6), as well as for 3-view variants (section 7.7), and for network trained with only contrastive loss (section 7.8). Seventh, we show a sample of our dataset as well as a visual sample of our material prediction results (section 7.9).
7.1 Data collection
Our data collection interface is shown in Fig. 8. Four different rendered views covering the front, sides and back of the textured 3D shape were shown. At the foot of the page, a single shape component was highlighted while the rest of the 3D shape appeared faded. Each query highlighted a different component. Workers were asked to select a label from a set of materials for the highlighted component. In total, we collected labeled components in shapes. On average of the surface area per mesh was labeled. For training, we kept only shapes with of components labeled ( shapes).
7.2 Training Points
To train the network, we sample evenly-distributed surface points from each of our 3D training shapes. Points lacking a material label, or externally invisible, are discarded. Point visibility is determined via ray-mesh intersection tests. The remaining points are subsampled to per shape. This subsampling is again performed so that selected points are approximately uniformly distributed along the shape surface. The choice to sample points per shape is due to memory limitations (we store the dataset in the main memory to avoid slow I/O during training). The views corresponding to these points are preprocessed and saved as single channel, unsigned integer arrays which are read directly into memory at training time to prevent I/O bottlenecks. Note that sampling roughly 75 points per shape requires 60G memory. Preprocessing to store into memory rather than reading from disk offered us a x speedup in training time.
7.3 Benchmark Test Set Distributions
In Fig. 9, we show the distribution of labels across components in the benchmark shape dataset, as well as the distribution of labels across the points sampled from these shapes that form our evaluation test set. Notice that although there are a large number of metal and plastic components, relatively few metal or plastic points are sampled. This is because many metal or components are small thin structures (e.g. handles, table legs). Recall that we sample our test points uniformly across the surface of our shapes and thus the surface area of a component is proportional to the number of points sampled from that component.
7.4 2D Classification Network Baseline
To evaluate the baseline of using a 2D material classification network, we use MINC. The network is based on GoogLeNet. We take their pretrained network and finetune on their dataset. The classification layer is finetuned to only classify the five materials we consider in this paper. Furthermore, we choose to finetune with greyscale images. The reason for this is that our our texture-less 3D renderings do not offer any color cues; therefore, we train the 2D network under similar conditions. This network is trained until validation losses converge with batchsize with stochastic gradient descent with momentum. The initial learning rate is set to and momentum is set to . The learning rate policy is polynomial decay with power . weight decay is set to . We call this finetuned network MINC-bw. The confusion matrix for the network on our test 3D renderings is in Fig. 10. The poor performance suggests that it is non-trivial to adapt 2D photos train a network to learn material descriptors for 3D shapes.
7.5 Embedding Visualization
We visualize the learned material-aware descriptor embedding with t-SNE in Fig. 11. In both the Classification and Multitask variations, we see a tendency of our network to cluster datapoints.
7.6 Classification vs Multitask Confusion Matrices
We show the confusion matrices for Classification (as well as Multitask, for reference) in Fig. 12. Note that Classification predictions are more biased towards wood. As a result, its glass performance drops after CRF since many glass points tend to lie on surfaces that resemble wood surfaces (e.g. flat table tops, flat cabinet doors) – if many local predictions are wood rather than glass, it is likely that the CRF will smooth the predictions to wood.
7.7 Confusion Matrices for 3 view MVCNN
The confusion matrices for 3 view MVCNNs (1 viewpoint, 3 distances) are in Fig. 13. For reference, the matrices for the 9 view MVCNNs (3 viewpoints, 3 distances) are also shown. Note that confusions are reduced when using 9 views over 3 views. For both Classification and Multitask, fabric performance is relatively unaffected by reduced views while plastic suffers. In Classification 3 views, wood predictions dominate. In Multitask 3 views, metal predictions dominate – as a consequence, glass does relatively well (since glass is typically competing with wood) and plastic does extremely poorly (since plastic parts can often be shaped like metal and our training dataset contains a high number of “plastic or metal“ labels relative to “plastic” labels).
7.8 Confusion Matrix for Contrastive Loss Only
The MVCNN trained with only contrastive loss achieves a mean class top 1 accuracy of 59% (in comparison to 65% with Classification and 66% with Multitask). This variant often confuses plastic for metal, and performs poorly on glass relative to the Classification or Multitask variants. The confusion matrix is shown in Fig. 14.
7.9 Sample of Dataset and Predictions
Here we show some samples from both our high-quality expert-annotated benchmark dataset as well as our large crowdsourced training dataset. Please refer to the legend by each shape for labels. The colors are consistent within each figure but may not be across figures.
Figure 15 shows a small sample of our benchmark test shapes with ground truth labels. Figure 16 shows per-point predictions for each of the 1024 test point samples on these benchmark test shapes. Figure 17 shows per-part predictions after the 1024 point predictions are smoothed with our symmetry-aware CRF. Figure 18 shows a small sample of our MTurk crowdsourced data with the 75 training point samples per shape shown.
- 3D Warehouse, Trimble Inc. https://3dwarehouse.sketchup.com. Accessed: 2017.
- Herman Miller, Inc. https://www.hermanmiller.com. Accessed: 2017.
- IKEA. http://www.ikea.com. Accessed: 2017.
- J. Barbič, F. S. Sin, and D. Schroeder. Vega FEM Library. http://www.jernejbarbic.com/vega, 2012.
- S. Bell, P. Upchurch, N. Snavely, and K. Bala. Opensurfaces: A richly annotated catalog of surface appearance. ACM Trans. Graph., 32(4):111:1–111:17, 2013.
- S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material recognition in the wild with the materials in context database. In Proc. CVPR, 2015.
- D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castellani, and P. Vandergheynst. Learning class-specific descriptors for deformable shapes using localized spectral convolutional networks. Computer Graphics Forum, 34(5):13–23, 2015.
- M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
- A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An information-rich 3D model repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
- D.-Y. Chen, X.-P. Tian, Y.-T. Shen, and M. Ouhyoung. On visual similarity based 3d model retrieval. 22(3):223–232, 2003.
- K. Chen, C. B. Choy, M. Savva, A. X. Chang, T. Funkhouser, and S. Savarese. Text2Shape: Generating shapes from natural language by learning joint embeddings. arXiv preprint arXiv:1803.08495, 2018.
- K. Chen, K. Xu, Y. Yu, T.-Y. Wang, and S.-M. Hu. Magic decorator: Automatic material suggestion for indoor digital scenes. ACM Trans. Graph., 34(6):232:1–232:11, 2015.
- M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proc. CVPR, pages 3606–3613, 2014.
- R. Girdhar, D. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In ECCV, 2016.
- T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In CVPR, 2018.
- R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Proc. CVPR, pages 1735–1742. IEEE, 2006.
- R. Hu, M. Savva, and O. van Kaick. Functionality representations and applications for shape analysis. Comp. Graph. For. (Eurographics State-of-The-Art Report), 2018.
- H. Huang, E. Kalogerakis, S. Chaudhuri, D. Ceylan, V. G. Kim, and E. Yumer. Learning local shape descriptors with view-based convolutional neural networks. ACM Trans. Graph., 37:6:1–6:14, 2018.
- A. Jain, T. Thormählen, T. Ritschel, and H.-P. Seidel. Material Memex: Automatic material suggestions for 3D objects. ACM Trans. Graph., 31(6):143:1–143:8, 2012.
- E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri. 3D shape segmentation with projective convolutional networks. In Proc. CVPR, 2017.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, 2014.
- D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proc. NIPS, 2012.
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML, 2001.
- Y. Li, R. Bu, M. Sun, and B. Chen. PointCNN. arXiv preprint arXiv:1801.07791, 2018.
- O. Litany, T. Remez, E. Rodola, A. M. Bronstein, and M. M. Bronstein. Deep functional maps: Structured prediction for dense shape correspondence. In Proc. ICCV, volume 2, page 8, 2017.
- C. Liu, L. Sharan, E. H. Adelson, and R. Rosenholtz. Exploring features in a bayesian framework for material recognition. In Proc. CVPR, pages 239–246, 2010.
- Z. Lun, E. Kalogerakis, and A. Sheffer. Elements of style: Learning perceptual shape style similarity. ACM Trans. Graph., 34(4), 2015.
- Z. Lun, E. Kalogerakis, R. Wang, and A. Sheffer. Functionality preserving shape style transfer. ACM Trans. Graph., 35(6), 2016.
- H. Maron, M. Galun, N. Aigerman, M. Trope, N. Dym, E. Yumer, V. G. Kim, and Y. Lipman. Convolutional neural networks on surfaces via seamless toric covers. Trans. Graph., 36(4), 2017.
- J. Masci, D. Boscaini, M. M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In Proc. ICCV workshops, pages 37–45, 2015.
- N. J. Mitra, M. Wand, H. Zhang, D. Cohen-Or, and M. Bokeloh. Structure-aware shape processing. Eurographics State of the Art Reports, 2013.
- S. Muralikrishnan, V. G. Kim, and S. Chaudhuri. Tags2Parts: Discovering semantic regions from shape tags. In Proc. CVPR, 2018.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep learning on point sets for 3d classification and segmentation. In Proc. CVPR, 2017.
- C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
- G. Riegler, A. O. Ulusoy, and A. Geiger. OctNet: Learning deep 3D representations at high resolution. In CVPR, 2017.
- M. Savva, A. X. Chang, and P. Hanrahan. Semantically-enriched 3D models for common-sense knowledge. CVPR Workshop on Functionality, Physics, Intentionality and Causality, 2015.
- L. Sharan, C. Liu, R. Rosenholtz, and E. Adelson. Recognizing materials using perceptually inspired features. IJCV, 103:348–371, 2013.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-H. Yang, and J. Kautz. SPLATNet: Sparse lattice networks for point cloud processing. In Proc. CVPR, 2018.
- H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller. Multi-view convolutional neural networks for 3D shape recognition. In Proc. ICCV, 2015.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. In Proc. CVPR, 2015.
- P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. Trans. Graph., 36(4), 2017.
- T. Y. Wang, H. Su, Q. Huang, J. Huang, L. Guibas, and N. J. Mitra. Unsupervised texture transfer from images to model collections. ACM Trans. Graph., 35(6):177:1–177:13, 2016.
- Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic Graph CNN for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
- T. Weyrich, J. Lawrence, H. P. A. Lensch, S. Rusinkiewicz, and T. Zickler. Principles of appearance acquisition and representation. Found. Trends. Comput. Graph. Vis., 4(2), 2009.
- Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D ShapeNets: A deep representation for volumetric shapes. In Proc. CVPR, 2015.
- Y.-L. Yang, J. Wang, and N. J. Mitra. Reforming shapes for material-aware fabrication. In Computer Graphics Forum, volume 34, pages 53–64. Wiley Online Library, 2015.
- L. Yi, H. Su, X. Guo, and L. Guibas. SyncSpecCNN: Synchronized spectral CNN for 3D shape segmentation. 2017.
- A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser. 3DMatch: Learning local geometric descriptors from RGB-D reconstructions. In Proc. CVPR, 2017.