Beam Search for Learning a Deep Convolutional Neural Network of 3D Shapes
Abstract
This paper addresses 3D shape recognition. Recent work typically represents a 3D shape as a set of binary variables corresponding to 3D voxels of a uniform 3D grid centered on the shape, and resorts to deep convolutional neural networks (CNNs) for modeling these binary variables. Robust learning of such CNNs is currently limited by the small datasets of 3D shapes available – an order of magnitude smaller than other common datasets in computer vision. Related work typically deals with the small training datasets using a number of ad hoc, handtuning strategies. To address this issue, we formulate CNN learning as a beam search aimed at identifying an optimal CNN architecture – namely, the number of layers, nodes, and their connectivity in the network – as well as estimating parameters of such an optimal CNN. Each state of the beam search corresponds to a candidate CNN. Two types of actions are defined to add new convolutional filters or new convolutional layers to a parent CNN, and thus transition to children states. The utility function of each action is efficiently computed by transferring parameter values of the parent CNN to its children, thereby enabling an efficient beam search. Our experimental evaluation on the 3D ModelNet dataset demonstrates that our model pursuit using the beam search yields a CNN with superior performance on 3D shape classification than the state of the art.
I Introduction
This paper addresses the problem of 3D shape classification. Our goal is to predict the object class of a given 3D shape, represented as a set of binary presenceabsence indicators associated with 3D voxels of a uniform 3D grid centered on the shape. This is one of the basic problems in computer vision, as 3D shapes are important visual cues for image understanding. This is a challenging problem, because an object’s shape may significantly vary due to changes in the object’s pose and articulation, and appear quite similar to shapes of other objects.
There is a host of literature on reasoning about 3D object shapes [1, 2]. Traditional approaches typically extract feature points from a 3D shape [3, 4, 5], and find correspondences between these feature points for shape recognition and retrieval [6, 7, 8]. However, these methods tend to be sensitive to longrange nonrigid and nonisometric shape deformations, because, in part, the feature points capture only local shape properties. In addition, finding optimal feature correspondences is often formulated as an NPhard nonconvex Quadratic Assignment Problem (QAP), whose efficient solutions come at the price of compromised accuracy.
Recently, the stateoftheart performance in 3D shape classification has been achieved using deep 3D Convolutional Neural Networks (CNNs), called 3D ShapeNets [9]. 3D ShapeNet extends the wellknown deep architecture called AlexNet [10], which is widely used in image classification. Specifically, the extension modifies 2D convolutions computed in AlexNet to 3D convolutions. However, 3D ShapeNets’ architecture consists of 3 convolutional layers and 2 fullyconnected layers, with a total of 12M parameters. This in turn means that a robust learning of ShapeNet parameters requires large training datasets. But in comparison with common datasets used in other domains, currently available 3D shape datasets are smaller by at least an order of magnitude. For example, the benchmark 3D shape datasets SHREC’14 dataset [11] has only 8K shapes, and ModelNet dataset [9] has 150K shapes, whereas the wellknown ImageNet [12] has 1.5M images.
Faced with small training datasets, existing deeplearning approaches to 3D shape classification typically resort to a number of ad hoc, handtuning strategies for robust learning. Rarely are these justified based on extensive empirical evaluation, as it would take a prohibitively long time, but on past findings of other related work (e.g., image classification). Hence, their particular design choices about the architecture and learning – e.g., the number of convolutional and fullyconnected layers used, or specification of the learning rate in backpropagation – may be suboptimal.
Motivated by the stateoftheart performance of 3D ShapeNets [9], we here adopt this framework, and focus on addressing the aforementioned issues in a principled manner. Specifically, we formulate a model pursuit for robust learning of 3D CNN. This learning is specified as a beam search aimed at identifying an optimal CNN architecture – namely, the number of layers, nodes, and their connectivity in the network – as well as estimating parameters of such an optimal CNN. Each state of the beam search corresponds to a candidate CNN. Two types of actions are defined to add either new convolutional filters or new convolutional layer to a parent CNN, and thus transition to children states. The utility function of each action is efficiently estimated as a training accuracy of the resulting CNN. The efficiency is achieved by transferring parameter values of the parent CNN to its children. Starting from the root “shallow and narrow” CNN, our beam search is guided by the utility function toward generating more complex CNN candidates with an increasingly larger classification accuracy on 3D shape training data, until training accuracy stops increasing. The CNN candidate with the highest training accuracy is finally taken as our 3D shape model, and used in testing.
In our experimental evaluation on the 3D ModelNet dataset [9], our beam search yields a 3D CNN with times fewer parameters than 3D ShapeNets [9]. The results demonstrate that our 3D CNN outperforms 3D ShapeNets by 3% on 40 shape classes. This suggests that a model pursuit using beam search is a viable alternative to currently heuristic practice in designing deep CNNs.
In the following, Sec. II gives an overview of our 3D shape classification using 3D CNN; Sec. III formulates our beam search in terms of the statespace, successor function, heuristic function, lookahead and backtrack strategy; Sec. IV specifies our efficient transfer of parameters from a parent model to its children in the beam search; and Sec. V presents our results.
Ii 3D Shape Classification Using 3D CNN
For 3D shape classification we use a 3D CNN. Given a binary volumetric representation of a 3D shape as input, our CNN predicts an object class of the shape. Below, we first describe the shape representation, and then explain the architecture of 3D CNN.
In this paper, we adopt the binary volumetric representation of 3D shapes presented in [9]. Specifically, each shape is represented as a set of binary indicators corresponding to 3D voxels of a uniform 3D grid centered on the shape. The indicators take value 1 if the corresponding 3D voxels are occupied by the 3D shape; and 0, otherwise. Hence, each 3D shape is represented by a binary threedimensional tensor. The grid size is set to voxels. The shape size is normalized such that a cube of voxels fully contains the shape, and the remaining empty voxels serve for padding in all directions around the shape. Each shape is also labeled with a corresponding object class.
As mentioned in Sec. I, the architecture of our 3D CNN is similar to that of 3D ShapeNets [9], with the important distinction that we greatly reduce the total number of parameters. As we will discuss in greater detail in the results section, the beam search that we use for the model pursuit yields a 3D CNN with 3 convolutional layers and 1 fully connected layer, totaling 80K parameters. The top layer of our model represents the standard softmax layer for classifying the input shapes into one of possible object classes.
In the following section, we specify our model pursuit and the initial root CNN from which the beam search starts exploring candidate, more complex models, until training error cannot be further reduced.
Iii Beam Search
Searchbased approaches have a longtrack record of successfully solving computer vision problems, including structured prediction for scene labeling[13, 14], object localization [15], and boundary detection [16]. Unlike the above related work, search in this paper is not used for inference, but for identifying an optimal CNN architecture and estimating CNN parameters. For efficiency of learning, we consider a beam search which limits the exploration of the state space to a few top candidates. Our beam search is defined by the following:

States correspond to CNN candidates,

Initial state represents a small CNN,

Successor function generates new states based on actions taken in parent states,

Heuristic function evaluates the utility of the actions, and thus guides the beam search,

Lookahead and backtracking strategy.
Statespace: The statespace is defined as , where state s represents a network configuration (also called architecture). A CNN’s network configuration specifies the number of convolutional and fullyconnected layers, the number of hidden units or 3D convolutional filters used in each layer, and which layers have maxpooling. In this paper, we constrain the beam search such that the size of the fully connected layer remains the same as in the initial CNN, because we have empirically found that only extending convolutional layers maximally increases the network’s classification accuracy (as also reported in [17, 18]).
Initial State: Our model pursuit starts from a relatively simple initial CNN, illustrated in Figure 1, and the goal of the beam search is to extend the initial model by adding either new convolutional filters to existing layers, or new convolutional layers. The initial model consists of only two convolutional layers and one fullyconnected layer. The first convolutional layer has 16 filters of size 6 and stride 2. The second convolutional layer has 32 filters of size 5 and stride 2. Finally, the fullyconnected layer has 400 hidden units.
The parameters of the initial CNN are trained as follows. We first generatively pretrain the model in a layerwise fashion, and then use a discriminative finetuning procedure. The standard Contrastive Divergence [19] is used to pretrain the two convolutional layers, whereas the top fullyconnected layer is trained using Fast Persistent Contrastive Divergence [20]. Once one layer is learned, the weights are fixed and the hidden activations are fed into the next layer as input. After this pretraining, we continue to discriminatively finetune the pretrained model. We first replace the topmost layer with a new randomly initialized fullyconnected layer, and then add a standard softmax layer on top of the network to output class probabilities. The standard crossentropy loss is computed using groundtruth class labels, and used in backpropagation to update the weights in all layers.
Given this simple, initial CNN, the beam gradually builds a search tree with new states s. Exploration of the state space consists of generating successor states from a few selected parent states. The selection is based on ranking the parent states by a heuristic function, as further explained below.
Successor function: , generates new states from s by applying an action from a set of possible actions A. In this paper, we specify A as consisting of two types of actions: 1) Add a new convolutional layer at the top of all existing convolutional layers, where the newly added layer has the same number of filters, filter size, and stride as the top convolutional layer; 2) Double the number of filters in the top convolutional layer. Other alternative definitions of A are also possible. In particular, we have also considered an action which adds maxpooling to convolutional layers; however, such an extended A has not produced better performance on test data, relative to the above case when only two types of actions are considered.
As one of our technical contributions, in this paper, we specify an efficient successor function for enabling an efficient beam search. Specifically, we apply a knowledge transfer procedure, following the approach of [21], which efficiently copies parameter values of the previous state s to values of newly added parameters in the generated state . After this knowledge transfer, the new CNN is finetuned using only a few iterations (in our experiments, the number of iterations is 10), for robustness. Note that a significantly larger number of iterations would have been necessary for this finetuning, had we randomly initialized the newly added parameters of (as is common in practice), instead of using knowledge transfer. In this way, we achieve efficiency. In the following section, we explain our knowledge transfer procedure in more detail.
Heuristic function: ranks new states given their parent states s. is used to guide the beam search, which selects the top successor states, where is taken as a beam width. is defined as the difference in classification accuracy on training data between s and .
Lookahead and backtracking strategy: For robustness, we specify a lookahead and backtracking strategy for selecting the top successor states. We first explore the state space by applying the successor function several times from parent states s, until the resulting tree search reaches a depth limit, . Then, among the leaf states at the tree depth , we select the top leaves evaluated with . From these top leaf states, we backtrack to the unique children of parent states s, taken as valid new candidate CNNs.
Iv Knowledge Transfer
When generating new candidate CNNs, we make our beam search efficient by appropriately transferring parameter values from parent CNNs to their descendants. In the sequel, we specify this knowledge transfer for the two types of search actions considered in this paper.
Iva Net2WiderNet
A new state can be generated by doubling the number of filters in the top convolutional layer of a parent CNN. This action effectively renders the new candidate CNN “wider” than its parent model, and hence we call this action Net2WiderNet. We estimate the parameters of the “wider” CNN as follows.
The key idea is to estimate the newly added parameters such that the parent CNN and its “wider” child CNN give the same outputs for the same inputs. This knowledgetransfer strategy ensures that the newly generated model is not worse than the previously considered model. After this knowledge transfer, parameters of the “wider” child CNN are finetuned to verify if the action resulted in a better model than the parent CNN s, as evaluated with the heuristic function .
In order to widen a convolutional layer, , we need to update both sets of model parameters and at layers and , respectively, where layer has inputs and outputs, and layer has outputs. When the action Net2WiderNet extends layer so it has outputs, we define the random mapping function as
(1) 
Then, the new sets of parameters and can be computed from and as
(2)  
(3) 
where , , and .
From (2), the first columns of are simply copied directly into . Columns through of are created by randomly choosing columns of , as defined in . The random selection is performed with replacement, so each column of may be copied many times to columns through of .
From (3), we similarly have that first rows of are simply copied directly into , and rows through of are created by randomly choosing rows of , as defined in . In addition, the new parameters in are normalized so as to account for the random replication of rows in . The normalization is computed by diving the new parameters with a replication factor, given by .
It is straightforward to prove that the resulting extended network with new parameters and produces the same outputs as the original network with parameters and , for the same inputs.
An example of this procedure is illustrated by Figure 2. In this example, we increase the size of hidden layer by adding one additional unit, while keeping the activations propagated to hidden layer unchanged. Assume that we randomly pick hidden unit to replicate, then we copy its weights and to the new unit. The weight , going out of , must be copied to also go out of . This outgoing weight must also be dived by 2 to compensate for the replication of .
IvB Net2DeeperNet
The second type of action that we consider is termed Net2DeeperNet, since it adds a new convolutional layer to a parent CNN, thereby producing a deeper child CNN. Specifically, Net2DeeperNet replaces a layer with two layers , where denotes the activation function. The new parameter matrix U is specified as the identity matrix.
Figure 3 shows an illustration of Net2DeeperNet. When we apply this action, we add a new convolutional layer and simply set the new convolution filters to be identity functions. A zero padding is also added to maintain the size of activations unchanged.
It is worth noting that Net2DeeperNet does not guarantee that the resulting deeper network will give the same outputs as the original one, for the same inputs, when the activation function used is the sigmoid. The guarantee holds when the activation function used is the rectified linear unit (ReLU), though. However, in our experiments, we have not found that using the sigmoid hurts the specified knowledge transfer of Net2DeeperNet toward the efficient beam search.
V Experimental Results
Va Dataset
For evaluation, we use the ModelNet dataset [9], and the same experimental set up of 3D ShapeNets [9], for fair comparison. ModelNet consists of 40 object classes such as chairs, tables,toilets, sofas, etc. Each class has 100 unique CAD models, representing the most common 3D shapes of the class, totaling 151,128 voxelized 3D models in the entire dataset. We conduct 3D classification on both the 10class subset ModelNet10, and the full 40class dataset, as in [9]. We use the provided voxelizations and train/test splits for evaluation. Specifically, for each class, 960 instances are used for training and 240 instances are used for testing.
We have implemented our beam search in MATLAB, on top of a GPUaccelerated software library of 3D ShapeNets [9]. Experiments are run on a machine with the NVIDIA Tesla K80 GPU accelerator.
VB 3D shape classification accuracy
Our classification accuracy is averaged over all classes on test data, and used for comparison with 3D ShapeNets [9]. In addition, we average our classification accuracy over the five runs of the beam search from 5 different initial CNNs, all of which have the same architecture, but differently initialized parameters.
We test how our performance varies for different depth limits , and beam widths . The training and testing accuracies, as well as the total beamsearch runtime are presented in Figures 4, 5, 6, 7, 8, 9.
In the experiments, we observe that when considering the third type of action which adds a new maxpooling layer, this particular action is never selected by the beam search. This is in part due to the fact that adding a pooling layer results in reinitializing subsequent fullyconnected layer. In turn, this reduces the effectiveness of already learned parameters. Because of this, we actually do not consider the action of adding a pooling layer in our specification of the beam search.
We compare our approach with two other approaches 3D ShapeNets [9] and DeepPano [22] in Table I. As can be seen, on ModelNet10 and ModelNet40, our accuracy is by 3.63% better than DeepPano in the 40class experiment, and by 2.55% better in the 10class experiment.
Algorithm  ModelNet40 Classification  ModelNet10 Classification 

Ours  81.26%  88.00% 
DeepPano [22]  77.63%  85.45% 
3DShapeNets [9]  77%  83.5% 
We observe that our model produced by the beam search has much fewer parameters than the network used in [9]. Their model consists of three convolutional and two fullyconnected learned layers. Their first layer has 48 filters of size 6; the second layer has 160 filters of size 5; the third layer has 512 filters of size 4; the fourth layer is a fully connected RBM with 1200 hidden units; and the fifth and final layer is a fullyconnected layer of size , which is the number of classes. Our best found model consists of three convolutional layers and one fullyconnected layer. Our first layer has 16 filters of size 6 and stride2; the second layer has 64 filters of size 5 and stride 2; the third layer has 64 filters of size 5 and stride 2; the last fullyconnected layer has hidden units. Our model has about 80K/12M = 0.6% parameters of theirs.
The recent literature also presents two works on 3D shape classification: VoxNet [23] and MVCNN [24], obtaining higher classification accuracies (90.1%, 83%) on ModelNet40 than ours. However, a direct comparison with these approaches is not suitable. VoxNet uses a training process that takes around 12 hours, while our individual training time for the best found model is less than 5 hours. MVCNN is based on the 2D information viewed from different angles around 3D shape, so it is inherently a 2D CNN approach but not related to 3D CNN. In addition, they also use the large collection of 2D images from ImageNet containing millions of images belonging to the same set of classes as the object categories presented in ModelNet40, to help their training process, while our work’s only dataset is ModelNet40. So based on these reasons, we believe it is not suitable for our experimental results to be compared to theirs.
Vi Conclusion
We have presented a new deep model pursuit approach for 3D shape classification. Our learning uses a beam search, which explores the search space of various candidate CNN architectures toward achieving maximal classification accuracy. The search tree is efficiently built using a training classification accuracy based heuristic function, as well as knowledge transfer to efficiently estimate parameters of new candidate models. Our experiments demonstrate that our approach outperforms the state of the art on the popular ModelNet10 and ModelNet40 3D shape datasets by 3% . Our approach also successfully reduces the total number of parameters by 99.4%. As our approach could be easily applied to other problems requiring robust deep learning on small training datasets.
Acknowledgment
This research has been supported in part by National Science Foundation under grants IIS1302700 and IOS1340112.
References
 [1] P. J. Besl and N. D. McKay, “A method for registration of 3D shapes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 2, Feb. 1992.
 [2] J. W. Tangelder and R. C. Veltkamp, “A survey of content based 3d shape retrieval methods,” Multimedia Tools Appl., vol. 39, no. 3, Sep. 2008. [Online]. Available: http://dx.doi.org/10.1007/s1104200701810
 [3] M. Aubry, U. Schlickewei, and D. Cremers, “The wave kernel signature: A quantum mechanical approach to shape analysis,” in CVS, Nov 2011.
 [4] J. Sun, M. Ovsjanikov, and L. Guibas, “A concise and provably informative multiscale signature based on heat diffusion,” in SGP’09, 2009.
 [5] T. Gatzke, C. Grimm, M. Garland, and S. Zelinka, “Curvature maps for local shape comparison,” in SMA, June 2005.
 [6] E. Rodola, S. Rota Bulo, T. Windheuser, M. Vestner, and D. Cremers, “Dense nonrigid shape correspondence using random forests,” in CVPR, June 2014.
 [7] M. Leordeanu and M. Hebert, “A spectral technique for correspondence problems using pairwise constraints,” in ICCV, vol. 2, Oct 2005.
 [8] F. Zhou and F. De la Torre, “Factorized graph matching,” in CVPR, June 2012.
 [9] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in CVPR, 2015.
 [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
 [11] L. et al., “A comparison of 3d shape retrieval methods based on a largescale benchmark supporting multimodal queries,” Computer Vision and Image Understanding, vol. 131, pp. 1 – 27, 2015.
 [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, 2015.
 [13] M. Lam, J. Rao Doppa, S. Todorovic, and T. G. Dietterich, “HCSearch for structured prediction in computer vision,” in CVPR, June 2015.
 [14] A. Roy and S. Todorovic, “Scene labeling using beam search under mutex constraints,” in CVPR, June 2014.
 [15] C. H. Lampert, M. B. Blaschko, and T. Hofmann, “Efficient subwindow search: A branch and bound framework for object localization,” PAMI, vol. 31, no. 12, 2009.
 [16] N. Payet and S. Todorovic, “Sledge: Sequential labeling of image edges for boundary detection,” IJCV, vol. 104, no. 1, 2013.
 [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv, 2015.
 [18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv, 2014.
 [19] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Computation, vol. 14, no. 8, 2002.
 [20] T. Tieleman and G. Hinton, “Using fast weights to improve persistent contrastive divergence,” in ICML, 2009.
 [21] T. Chen, I. Goodfellow, and J. Shlens, “Net2net: Accelerating learning via knowledge transfer,” arXiv, 2015.
 [22] B. Shi, S. Bai, Z. Zhou, and X. Bai, “Deeppano: Deep panoramic representation for 3d shape recognition,” SPL, vol. 22, no. 12, 2015.
 [23] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for realtime object recognition,” in IROS. IEEE, 2015.
 [24] H. Su, S. Maji, E. Kalogerakis, and E. LearnedMiller, “Multiview convolutional neural networks for 3d shape recognition,” in ICCV, 2015.