Deep Convolutional Decision Jungle
for Image Classification
Abstract
We propose a novel method called deep convolutional decision jungle (CDJ) and its learning algorithm for image classification. The CDJ maintains the structure of standard convolutional neural networks (CNNs), i.e. multiple layers of multiple response maps fully connected. Each response map—or node—in both the convolutional and fullyconnected layers selectively respond to class labels s.t. each data sample travels via a specific soft route of those activated nodes. The proposed method CDJ automatically learns features, whereas decision forests and jungles require predefined feature sets. Compared to CNNs, the method embeds the benefits of using datadependent discriminative functions, which better handles multimodal/heterogeneous data; further, the method offers more diverse sparse network responses, which in turn can be used for costeffective learning/classification. The network is learnt by combining conventional softmax and proposed entropy losses in each layer. The entropy loss, as used in decision tree growing, measures the purity of data activation according to the class label distribution. The backpropagation rule for the proposed loss function is derived from stochastic gradient descent (SGD) optimization of CNNs. We show that our proposed method outperforms stateoftheart methods on three public image classification benchmarks and one face verification dataset. We also demonstrate the use of auxiliary data labels, when available, which helps our method to learn more discriminative routing and representations and leads to improved classification.
00B0°
1 Introduction
Random forests (RF) has been widely used as an ensemble classifier for image classification. Whereas RF is a feature selection method, convolutional neural networks (CNNs) has been proven for its powerful feature learning. Recently, several works CNNtree2014 (); CNNtree2015 (); CNNtree20152 (); CNNtree2016 (); sparseCNN1 (); sparseCNN2 (); sparseCNN3 (); clusterCNN1 (); clusterCNN2 () have attempted combining the two worlds (e.g. see Fig. 1a) for incorporating hierarchical treestructures CNNtree2014 (); CNNtree2015 (), multimodal data CNNtree20152 (), modular networks by clustering and CNNs clusterCNN1 (); clusterCNN2 (), and/or accelerating speed CNNtree2016 (). Also relevant to this study is to encourage sparsity in representation for regularization and efficiency in memory/time sparseCNN1 (); sparseCNN2 (); sparseCNN3 (). They reported improved accuracy; however, there is room to improve especially in following aspects:

In the adopted binary tree structure CNNtree2014 (); CNNtree2015 (); CNNtree20152 (); CNNtree2016 (), once a data sample goes in a wrong path, it cannot be recovered i.e. overfitting. Soft partitioning CNNtree2015 () helps relieve the issue to a certain degree, however, its structure exponentially grows by recursive binary splits CNNtree20152 (); CNNtree2016 (). The decision jungle jungle () type of algorithm becomes a natural extension and has shown good generalization ability.

Most existing methods for combining trees and CNNs CNNtree20152 (); CNNtree2016 () require additional model parameters. The split process is often done by an another deep routing network, and it happens recursively down the tree structure, exploding the number of parameters. It is not straightforward how to apply to existing large/deep networks Alex_nips_2012 (); VGGNet (); GoogleNet ().
In this work, we propose a novel method that applies the concept of class entropy or purity in RFs to existing CNN structures (Fig. 1b). The proposed method helps learn more discriminative and robust features from early layers. The architecture offers the following benefits:

Classwise purity in early layers: Our method learns a convolutional neural network (CNN) with entropy loss per layer. The proposed loss is to help purify response maps, which we will call ‘nodes’, in all intermediate layers of CNNs. The response maps in each layer are pushed on and off conditionally on input vectors s.t. each response map is dedicated to certain classes (ideally a class) than all. Note, the response maps for each data point take continuous not discretized values, thus this can be treated as kind of ‘soft’ routing.

Decision jungle structure: Unlike binary decision trees, in the decision jungle structure fully connected, the sample’s paths are recoverable in later layers. The decision jungle has been shown more robust than binary trees by its improved generalization ability. Also, such a structure makes the method be more flexible (applicable to any existing CNNs not altering their architectures) and more memory efficient than the binary tree structures jungle ().

Minimum number of additional parameters: The proposed architecture learns routing directly using the activation/deactivation of the response maps, which keeps the number of model parameters low as in original CNN models. Existing methods [24,7] require routing networks as well as CNNs, where the number of routing networks exponentially increases in a binary tree. The proposed method only adds few parameters e.g. the balancing parameter (see Sec. 3.1).

Encoding of auxiliary information: Intermediate layers can be further purified by auxiliary labels in addition to the class labels used for the softmax loss. Experiments show it leads to significant accuracy improvements.
2 Related work
Tree structure  Routing  Big  Additional  

Split  Conv  FC  w/o Parameters  Architecture  Information  
CNNtree2014 ()  Binary  ✓  
CNNtree2015 ()  Binary  ✓  ✓  
CNNtree20152 ()  Binary  ✓  
CNNtree2016 ()  Soft multi  ✓  ✓  
Ours  Soft multi  ✓  ✓  ✓  ✓  ✓ 
Combination of CNNs and tree structures.
The objective of combining trees and CNNs in CNNtree2014 (); CNNtree2015 (); CNNtree20152 (); CNNtree2016 () is for learning both feature representations of input data and a treestructure classifier conditioned on input data, in a joint manner. Previous methods can be categorized with several attributes as in Table 1: The tree structures in CNNtree2014 (); CNNtree2015 () are embedded in the fully connected (FC) layers CNNtree2015 () rather than the convolutional (Conv) layers or multilayer Perceptrons CNNtree2014 (), which provide the final prediction as a classifier. In contrast, the tree structures in CNNtree20152 (); CNNtree2016 () are embedded in the convolutional layers. They learn hidden modalities of data e.g. face poses in CNNtree20152 (), superclasses in CNNtree2016 (), from the early layers. The work in CNNtree2016 () uses continuous weights rather than discrete weights, thus their method can represent multiple ‘soft’ routes than ‘hard’ binary splits in CNNtree2014 (); CNNtree2015 (); CNNtree20152 (). However, the work has demonstrated routing only to 23 splits. A major difficulty in applying CNNtree2016 () to the fully connected network like decision jungle jungle (), where a node routes to all the nodes in the next layer, is in need of additional routing network parameters. They increase as the number of routes increases. Furthermore, the works in CNNtree20152 (); CNNtree2014 () use small network architectures, while the methods in CNNtree2015 (); CNNtree2016 () were applied to recentlyproposed CNN architectures (e.g., AlexNet Alex_nips_2012 (), VGG16 Net VGGNet (), and GoogleNet GoogleNet ()). Finally, the concept of tree or conditional activation improved CNN efficiency sparseCNN2 (); sparseCNN3 ().
Compared to previous works CNNtree2014 (); CNNtree2015 (); CNNtree20152 (); CNNtree2016 (), our method does not exploit the routing parameters, while enabling multiple soft routing and applying to existing big architectures CNNtree2016 (). The experiments using three image classification and face verification benchmarks demonstrate improved accuracy (Sec. 4). When using additional label information, the method further improves the accuracy via a more explicit network routing using the labels.
Combination of CNNs and clustering loss.
Though not explicitly using the tree strutures, relevant is a group of work that learns CNNs while performing hierarchical clustering clusterCNN1 (); clusterCNN2 (); sparseCNN1 (): the method in clusterCNN1 () iteratively performs feature representation learning and data clustering to better cluster data samples, the method in clusterCNN2 () introduces a clustering loss to help the classification task and the concept of mixture of experts is introduced in the LSTM architecture sparseCNN1 () to improve the model capacity for language modeling. Setbased loss functions setbased1 (); magnet () have also been proposed to encourage optimizing the intraclass/interclass data variation, in addition to the softmax function. Compared to these existing methods, our approach captures data separation from the early layers using supervised discrete class labels rather than exploiting sample distances in an unsupervised way.
3 Deep convolutional decision jungle architecture
The proposed deep convolutional decision jungle (CDJ) adopts the traditional convolutional neural network (CNN) architecture. Given a fixed network topology (e.g. the number and sizes of layers), the operation of a CNN is prescribed by a set of weight vectors (or convolution filters) :^{1}^{1}1The weight vectors of the fullyconnected layers are regarded as convolution filters of size 1. The filter at the th layer is convolved with the th output (or response) map of the th layer to generate the th layer’s th response map :
(1) 
where is the number of response maps in the th layer, denotes the convolution operation, and represents the nonlinear activation and maxpooling operations. We use rectified linear units (ReLUs) but other differentiable activation functions can also be used.
The CDJ is instantiated by interpreting the response maps and convolution filters, respectively, as decision nodes and edges joining pairs of nodes in a decision jungle (Fig. 1b). In the conventional decision trees and jungles, an input data point at a node is exclusively routed to a single child node. However, in our CDJ, a soft decision is made at each node : If the filter produces a nonzero response map , is interpreted as being softly routed to node .
3.1 Training deep convolutional decision jungles
Suppose we are given a set of training data points with being the dimensionality of the input space. For a class classification problem, . Training a CDJ corresponds to identifying the optimal set of convolutional filter parameters , which we achieve by minimizing an energy functional that combines the standard training error with a new entropy energy:
(2) 
where controls the contributions of the training error and entropy energy terms and is the number of layers in CDJ. We use the softmax loss for the training cost functional :
(3) 
With the goal of embedding a decision jungle into a CNN architecture, we will design our entropy energy to measure the quality of routing performed on the dataset . One way of constructing such a regularizer is to adapt the class entropybased training criteria used in decision forests and jungles jungle () into a differentiable cost functional.
Our entropy measure is defined based on the empirical class distributions at layer :
(4) 
where
(5) 
with if and , otherwise. The scalar response is obtained by applying the nonlinear activation (ReLU) followed by average pooling to . The joint distribution is a twodimensional array storing the relative frequencies of patterns activated at each pair of response and class indices. This definition is consistent with our interpretation of CNN response maps as nodes of a decision jungle. The entropy of the th layer is then defined as
(6) 
This expression is differentiable with respect to , and facilitates gradient descenttype optimization, e.g. stochastic gradient descent (SGD).
Stratified entropy loss.
The entropy energy is highly nonlinear with multiple global (as well as local) minima, posing a significant challenge in the optimization of . In our preliminary experiments, we observed that the new entropy energy (with in ; Eq. 2) almost always degrades the performance as it tends to generate degenerate probability distributions and : The optimized solutions disable some classes and nodes by allocating (near)zero probabilities to selected rows and columns (corresponding to specific classes and nodes, respectively). This contributes to reducing the overall entropy, but it leads to poor generalization. Figure 2 illustrates this problem with an example.
Our strategy is to explicitly control the optimization trajectory of by setting up intermediate goal states. To facilitate the exposition of our optimization framework, we first define a twodimensional matrix that represents the unnormalized joint class distribution (see Eq. 5) for each layer:
(7) 
This variable encapsulates the behavior of exercised in the class probability distribution. The proposed CDJ then iteratively identifies the solution (equivalently ) by solving the new (sub)optimization problem per iteration. Our guided optimization problem minimizes
(8) 
where is the Frobenius norm.
At each iteration, we first determine the guide variables and then optimize the (unnormalized) probability map by penalizing its deviation from plus the softmax loss. The guide variable is designed to explicitly avoid the degenerate cases, while reducing the entropy^{2}^{2}2More precisely, it should be regarded as the inverse of class purity as is not a probability distribution. from the previous map as detailed shortly. Given the guide variables, the optimization of convolution filters is performed based on standard SGD. The derivative of (Eq. 8) is written as
(9) 
with (see Eq. 7)
(10) 
and are standard terms appearing in CNNs and can be calculated straightforwardly.
Construction of guide variable .
Our guide variable corresponds to an entropyreduced version of :
(11) 
where and determine the target update directions and magnitudes, respectively. These two variables encode a set of constraints that helps to avoid the degeneracy of (and equivalently of ) while reducing the overall entropy. First, we enforce that only one entry of takes value 1 per row (note ):
(12) 
This, combined with the selection of entries discussed shortly, ensures that the resulting entropy of is smaller than the original response map . Second, we enforce that the column sums of are roughly balanced, explicitly preventing the allocation of zero probabilities to any class (Fig. 2b):
(13) 
The third condition retains the total mass of each column of preventing any filters converging towards zero (Fig. 2c):
(14) 
Since is a binary variable, Eq. 14 and Eq. 11 uniquely determine :
(15) 
where takes the role of the balancing parameter in (cf. Eq. 2) which we fixed at throughout the entire experiments. Also, the effects of values are experimented in Fig. 3 (c).
Training with auxiliary information.
Our new entropy loss can be defined for any variables that replace the class labels . This enables a systematic way of exploiting auxiliary information when available. We demonstrate the effectiveness of this approach in our face verification experiment using auxiliary pose labels (see Table 4).
4 Experiments
Setup.
We evaluate our convolutional decision jungle (CDJ) on three standard image classification datasets: OxfordIIIT Pet oxfordpetdata (), CIFAR100 cifar (), and Caltech101 caltech101 (); and one face verification dataset: MultiPIE multipie (). The OxfordIIIT Pet dataset consists of 7,349 images covering 37 different breeds of cats and dogs (e.g., Shiba Inu and Yorkshire Terrier). We adopt the training and test dataset split of 3,680 and 3,669 images respectively oxfordpetdata (); magnet (). The CIFAR100 dataset contains 60,000 natural images of 100 classes that are categorized into 20 superclasses. We use a 50,000 training and 10,000 testing image split stoc (); maxout (); nin (); clusterCNN2 (). The Caltech101 dataset contains 9,146 images of 101 object categories. 30 images are chosen from each category for training and the remaining images (i.e., at most per category) are used for testing wacv2017_lcn ().
The MultiPIE (Session 1 subset) face verification dataset consists of photographs of 250 individuals taken at 20 different illumination levels and 15 poses ranging from to . We proceed as per CNNtree20152 (): For training, we use all images (15 poses, 20 illumination levels) of the first 150 individuals, while for testing we use one frontal view with neutral illumination (ID07 entries) as the gallery image for each of the remaining 100 subjects. The rest of the images are used as probes. We use the responses of the penultimate layer of the trained networks as features for cosine distancebased matching.
For each dataset, we initialize our algorithm with a pretrained standard CNNs that achieve stateoftheart performance: AlexNet Alex_nips_2012 () and VGG16 VGGNet () for OxfordIIIT Pet and Caltech101, respectively, and NiN nin () for CIFAR100. For MultiPIE, previous stateoftheart results were achieved by cCNN Forests CNNtree20152 () that embed a decision tree structure into a small CNN (Sec. 2). Empirically, largescale AlexNet achieves higher accuracy than CNNtree20152 (), and so we adopt AlexNet as our baseline.
The training parameters, including the network topology, minibatch sizes, number of training epochs, and dropout and local response normalization decisions are adopted from these individual networks: Minibatch sizes are 256, 32, 100 for AlexNet, VGG16, and NiN, respectively. The number of epochs are 60 for AlexNet and VGG16, and 100 for NiN. The learning rate is scheduled from to at every epoch for AlexNet, from to for VGG16. Our learning rate for the NiN architecture is overall slightly smaller than nin (), being fixed at until the 80th epoch, and reduced to and after 80th and 90th epochs, respectively: This small learning rate ensures that the energy functional decreases constantly. For our new entropy energy, we use larger minibatch sizes of ( being the number of classes) to ensure that the class distributions are balanced within a batch.
The class balancing constraint (Eq. 12) causes our entropy energy to apply only to the layers whose sizes () are greater than or equal to the number of classes (): We apply the entropy energy from the second convolutional layers for AlexNet and VGG16, and from cccp3 layer for NiN. In this case, we regard the earlier layers—which do not use the entropy loss—as feature extractors for a decision jungle.
Results.
Table 5 compares the results of our method to eleven stateoftheart methods across three image classification problems. Overall, our algorithm significantly improves upon all baselines, which demonstrates the effectiveness of our new entropy energy. Our method consistently outperforms stateoftheart methods including DSN deepsuper () and DDN clusterCNN2 (). In particular, our algorithm improves upon existing algorithms that augment the standard softmax loss functions similarly to ours, including the setbased magnet loss magnet () (which further adopts data augmentation to improve the performance) and clustering loss clusterCNN2 (). Our network attains routing from supervised class information rather than from unsupervised sample distances, as in these previous works, which helps learn more discriminative features.
Figure 3 shows the effect of varying hyperparameters: (a) and (b) show the testing accuracy with respect to epoch and , respectively; (c) shows the average entropy of each layer in CNNs and CDJs: For both networks, entropy becomes lower toward the last layers. Minimizing the standard softmax loss in CNN also tends to improve the class purity around the last layer. However, by explicitly enforcing it, our algorithm achieved constantly lower entropy. The strong correlation between the testing accuracy (a) and the average entropy of each layer (c) demonstrates that our new entropy contributes to improving generalization performance.
Table 4 shows the results of different algorithms on the MultiPIE face verification dataset. Previously, the stateoftheart performance on MultiPIE dataset was achieved by cCNN forests CNNtree20152 () which embed binary decision trees into a CNN architecture. This architecture is applicable to only moderatescale networks as the network size increases exponentially with respect to increasing number of splits. In contrast, CDJ enables to build largescale decision networks by embedding computationally efficient decision jungles into CNNs.
We observed that by simply training (largescale) AlexNet already outperforms cCNN forests (Fig. 4). Our CDJ building upon AlexNet, is on par with the original AlexNet when the class labels are used to build the entropy energy. However, our network offers flexibility to incorporate any auxiliary labels, leading to significantly improved accuracy over AlexNet when auxiliary pose information is exploited. This demonstrates the utility of CDJ as a flexible alternative to cCNN forest CNNtree20152 ().
5 Conclusion
We propose deep convolutional decision jungles (CDJs), which share the benefits of both decision jungles and CNNs: classwise purity at each node and endtoend feature learning/classification. Compared to the existing combination of decision graphs and CNNs (e.g., cCNN Forests CNNtree20152 ()) that already outperform CNNs, our model offers higher flexibility in routing data and enables us to exploit largescale CNN architectures. This is facilitated by our new decision jungle architecture that provides multiple soft routing possibilities. Our CDJ also offers a systematic way of exploiting auxiliary information when available. Applied to three image classification problems and a face verification problem, our algorithms demonstrate improved accuracy over stateoftheart methods.
References
 (1) E. Bengio, P.L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks for faster models. In Proc. ICLR workshop track, 2016.
 (2) S. R. Bulò and P. Kontschieder. Neural decision forests for semantic image labeling. In Proc. CVPR, 2014.
 (3) L. FeiFei, R. Fergus, and P. Perona. OneShot learning of object categories. TPAMI, 2006.
 (4) I. J. Goodfellow, D. WardeFarley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In Proc. ICML, 2013.
 (5) R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multipie. Image and Vision Computing, 2010.
 (6) Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi. Deep roots: Improving CNN efficiency with hierarchical filter groups. In Proc. CVPR, 2017.
 (7) Y. Ioannou, D. Robertson, D. Zikic, P. Kontschieder, J. Shotton, M. Brown, and A. Criminisi. Decision forests, convolutional networks and the models inbetween. In ArXiv 1603.01250, 2016.
 (8) Z. Jiang, Y. Wang, L. Davis, W. Andrews, and V. Rozgic. Learning discriminative features via label consistent neural network. In Proc. WACV, 2017.
 (9) P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò. Deep neural decision forests. In Proc. ICCV, 2015.
 (10) A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
 (11) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proc. NIPS, 2012.
 (12) C.Y. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In Proc. AISTATS, 2015.
 (13) M. Lin, Q. Chen, and S. Yan. Network in network. In Proc. ICLR, 2014.
 (14) V. N. Murthy, V. Singh, T. Chen, R. Manmatha, and D. Comaniciu. Deep decision network for multiclass image classification. In Proc. CVPR, 2016.
 (15) O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In Proc. CVPR, 2012.
 (16) O. Rippel, M. Paluri, P. Dollár, and L. Bourdev. Metric learning with adaptive density discrimination. In Proc. ICLR, 2016.
 (17) N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. In Proc. ICLR, 2017.
 (18) J. Shotton, T. Sharp, P. Kohli, S. Nowozin, J. Winn, and A. Criminisi. Decision jungles: compact and rich models for classification. In Proc. NIPS, 2013.
 (19) K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman. Fisher vector faces in the wild. In Proc. BMVC, 2013.
 (20) K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In Proc. ICLR, 2015.
 (21) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. CVPR, 2015.
 (22) Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In Proc. ECCV, 2016.
 (23) C. Xiong, X. Zhao, D. Tang, K. Jayashree, S. Yan, and T.K. Kim. Conditional convolutional neural network for modalityaware face recognition. In Proc. ICCV, 2015.
 (24) J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and image clusters. In Proc. CVPR, 2016.
 (25) M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. In Proc. ICLR, 2013.
 (26) M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Proc. ECCV, 2014.
 (27) Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identitypreserving face space. In Proc. ICCV, 2013.