# Deep Convolutional Decision Jungle for Image Classification

## Abstract

We propose a novel method called deep convolutional decision jungle (CDJ) and its learning algorithm for image classification. The CDJ maintains the structure of standard convolutional neural networks (CNNs), i.emultiple layers of multiple response maps fully connected. Each response map—or *node*—in both the convolutional and fully-connected layers selectively respond to class labels s.t. each data sample travels via a specific *soft* route of those activated nodes. The proposed method CDJ automatically learns features, whereas decision forests and jungles require pre-defined feature sets. Compared to CNNs, the method embeds the benefits of using data-dependent discriminative functions, which better handles multi-modal/heterogeneous data; further, the method offers more diverse sparse network responses, which in turn can be used for cost-effective learning/classification. The network is learnt by combining conventional softmax and proposed entropy losses in each layer. The entropy loss, as used in decision tree growing, measures the purity of data activation according to the class label distribution. The back-propagation rule for the proposed loss function is derived from stochastic gradient descent (SGD) optimization of CNNs. We show that our proposed method outperforms state-of-the-art methods on three public image classification benchmarks and one face verification dataset. We also demonstrate the use of auxiliary data labels, when available, which helps our method to learn more discriminative routing and representations and leads to improved classification.

## 1Introduction

Random forests (RF) has been widely used as an ensemble classifier for image classification. Whereas RF is a feature selection method, convolutional neural networks (CNNs) has been proven for its powerful feature learning. Recently, several works [2] have attempted combining the two worlds (e.gsee Figure 1a) for incorporating hierarchical tree-structures [2], multimodal data [23], modular networks by clustering and CNNs [24], and/or accelerating speed [7]. Also relevant to this study is to encourage sparsity in representation for regularization and efficiency in memory/time [17]. They reported improved accuracy; however, there is room to improve especially in following aspects:

In the adopted binary tree structure [2], once a data sample goes in a wrong path, it cannot be recovered i.eoverfitting. Soft partitioning [9] helps relieve the issue to a certain degree, however, its structure exponentially grows by recursive binary splits [23]. The decision jungle [18] type of algorithm becomes a natural extension and has shown good generalization ability.

Most existing methods for combining trees and CNNs [23] require additional model parameters. The split process is often done by an another deep routing network, and it happens recursively down the tree structure, exploding the number of parameters. It is not straightforward how to apply to existing large/deep networks [11].

In this work, we propose a novel method that applies the concept of class entropy or purity in RFs to existing CNN structures (Fig. Figure 1b). The proposed method helps learn more discriminative and robust features from early layers. The architecture offers the following benefits:

Class-wise purity in early layers: Our method learns a convolutional neural network (CNN) with entropy loss per layer. The proposed loss is to help purify response maps, which we will call ‘nodes’, in all intermediate layers of CNNs. The response maps in each layer are pushed on and off conditionally on input vectors s.t. each response map is dedicated to certain classes (ideally a class) than all. Note, the response maps for each data point take continuous not discretized values, thus this can be treated as kind of ‘soft’ routing.

Decision jungle structure: Unlike binary decision trees, in the decision jungle structure fully connected, the sample’s paths are recoverable in later layers. The decision jungle has been shown more robust than binary trees by its improved generalization ability. Also, such a structure makes the method be more flexible (applicable to any existing CNNs not altering their architectures) and more memory efficient than the binary tree structures [18].

Minimum number of additional parameters: The proposed architecture learns routing directly using the activation/de-activation of the response maps, which keeps the number of model parameters low as in original CNN models. Existing methods [24,7] require routing networks as well as CNNs, where the number of routing networks exponentially increases in a binary tree. The proposed method only adds few parameters e.gthe balancing parameter (see Section 3.1).

Encoding of auxiliary information: Intermediate layers can be further purified by auxiliary labels in addition to the class labels used for the softmax loss. Experiments show it leads to significant accuracy improvements.

## 2Related work

Routing | Big | Additional | ||||

Split | Conv | FC | w/o Parameters | Architecture | Information | |

Binary | ||||||

Binary | ||||||

Binary | ||||||

Soft multi | ||||||

Ours | Soft multi | |||||

**Combination of CNNs and tree structures.** The objective of combining trees and CNNs in [2] is for learning both feature representations of input data and a tree-structure classifier conditioned on input data, in a joint manner. Previous methods can be categorized with several attributes as in Table 1: The tree structures in [2] are embedded in the fully connected (FC) layers [9] rather than the convolutional (Conv) layers or multi-layer Perceptrons [2], which provide the final prediction as a classifier. In contrast, the tree structures in [23] are embedded in the convolutional layers. They learn hidden modalities of data e.gface poses in [23], super-classes in [7], from the early layers. The work in [7] uses continuous weights rather than discrete weights, thus their method can represent multiple ‘soft’ routes than ‘hard’ binary splits in [2]. However, the work has demonstrated routing only to 2-3 splits. A major difficulty in applying [7] to the fully connected network like decision jungle [18], where a node routes to all the nodes in the next layer, is in need of additional routing network parameters. They increase as the number of routes increases. Furthermore, the works in [23] use small network architectures, while the methods in [9] were applied to recently-proposed CNN architectures (e.g, AlexNet [11], VGG-16 Net [20], and GoogleNet [21]). Finally, the concept of tree or conditional activation improved CNN efficiency [6].

Compared to previous works [2], our method does not exploit the routing parameters, while enabling multiple soft routing and applying to existing big architectures [7]. The experiments using three image classification and face verification benchmarks demonstrate improved accuracy (Sec. Section 4). When using additional label information, the method further improves the accuracy via a more explicit network routing using the labels.

**Combination of CNNs and clustering loss.** Though not explicitly using the tree strutures, relevant is a group of work that learns CNNs while performing hierarchical clustering [24]: the method in [24] iteratively performs feature representation learning and data clustering to better cluster data samples, the method in [14] introduces a clustering loss to help the classification task and the concept of mixture of experts is introduced in the LSTM architecture [17] to improve the model capacity for language modeling. Set-based loss functions [22] have also been proposed to encourage optimizing the intra-class/inter-class data variation, in addition to the softmax function. Compared to these existing methods, our approach captures data separation from the early layers using supervised discrete class labels rather than exploiting sample distances in an unsupervised way.

## 3Deep convolutional decision jungle architecture

The proposed deep convolutional decision jungle (CDJ) adopts the traditional convolutional neural network (CNN) architecture. Given a fixed network topology (e.gthe number and sizes of layers), the operation of a CNN is prescribed by a set of weight vectors (or convolution filters) :^{1}*response*) map of the -th layer to generate the -th layer’s -th response map :

where is the number of response maps in the -th layer, denotes the convolution operation, and represents the non-linear activation and max-pooling operations. We use rectified linear units (ReLUs) but other differentiable activation functions can also be used.

The CDJ is instantiated by interpreting the response maps and convolution filters, respectively, as decision nodes and edges joining pairs of nodes in a decision jungle (Fig. Figure 1b). In the conventional decision trees and jungles, an input data point at a node is exclusively *routed* to a single child node. However, in our CDJ, a *soft decision* is made at each node : If the filter produces a non-zero response map , is interpreted as being softly routed to node .

### 3.1Training deep convolutional decision jungles

Suppose we are given a set of training data points with being the dimensionality of the input space. For a -class classification problem, . Training a CDJ corresponds to identifying the optimal set of convolutional filter parameters , which we achieve by minimizing an energy functional that combines the standard training error with a new *entropy energy*:

where controls the contributions of the training error and entropy energy terms and is the number of layers in CDJ. We use the softmax loss for the training cost functional :

With the goal of embedding a decision jungle into a CNN architecture, we will design our entropy energy to measure the *quality of routing* performed on the dataset . One way of constructing such a regularizer is to adapt the class entropy-based training criteria used in decision forests and jungles [18] into a differentiable cost functional.

Our entropy measure is defined based on the empirical class distributions at layer :

where

with if and , otherwise. The scalar response is obtained by applying the non-linear activation (ReLU) followed by average pooling to . The joint distribution is a two-dimensional array storing the relative frequencies of patterns activated at each pair of response and class indices. This definition is consistent with our interpretation of CNN response maps as nodes of a decision jungle. The entropy of the -th layer is then defined as

This expression is differentiable with respect to , and facilitates gradient descent-type optimization, e.gstochastic gradient descent (SGD).

**Stratified entropy loss.** The entropy energy is highly non-linear with multiple global (as well as local) minima, posing a significant challenge in the optimization of . In our preliminary experiments, we observed that the new entropy energy (with in ; Equation 1) almost always degrades the performance as it tends to generate *degenerate* probability distributions and : The optimized solutions *disable* some classes and nodes by allocating (near-)zero probabilities to selected rows and columns (corresponding to specific classes and nodes, respectively). This contributes to reducing the overall entropy, but it leads to poor generalization. Figure 2 illustrates this problem with an example.

Our strategy is to explicitly control the optimization trajectory of by setting up intermediate goal states. To facilitate the exposition of our optimization framework, we first define a two-dimensional matrix that represents the unnormalized joint class distribution (see Equation 3) for each layer:

This variable encapsulates the behavior of exercised in the class probability distribution. The proposed CDJ then iteratively identifies the solution (equivalently ) by solving the new (sub-)optimization problem per iteration. Our guided optimization problem minimizes

where is the Frobenius norm.

At each iteration, we first determine the *guide variables* and then optimize the (unnormalized) probability map by penalizing its deviation from plus the softmax loss. The guide variable is designed to explicitly avoid the degenerate cases, while reducing the entropy^{2}

with (see Equation 5)

and are standard terms appearing in CNNs and can be calculated straightforwardly.

**Construction of guide variable .** Our guide variable corresponds to an entropy-reduced version of :

where and determine the target update directions and magnitudes, respectively. These two variables encode a set of constraints that helps to avoid the degeneracy of (and equivalently of ) while reducing the overall entropy. First, we enforce that only one entry of takes value 1 per row (note ):

This, combined with the selection of entries discussed shortly, ensures that the resulting entropy of is smaller than the original response map . Second, we enforce that the column sums of are roughly balanced, explicitly preventing the allocation of zero probabilities to any class (Fig. Figure 2b):

The third condition retains the total *mass* of each column of preventing any filters converging towards zero (Fig. Figure 2c):

Since is a binary variable, Equation 11 and Equation 8 uniquely determine :

where takes the role of the balancing parameter in (cfEquation 1) which we fixed at throughout the entire experiments. Also, the effects of values are experimented in Figure 3 (c).

Finally, is decided by assigning to the entries corresponding to large values of : First we sort in decreasing order. Then the entries of are visited in this order and the corresponding entry value is determined at if conditions in Eqs. Equation 9 and Equation 10 are satisfied, and , otherwise.

We adopt mini-batch optimization instead of batch-optimizing the cost-functional (Eq. Equation 6). This facilitates efficient training using a GPU. Algorithm ? summarizes the CDJ training process.

**Training with auxiliary information.** Our new entropy loss can be defined for any variables that replace the class labels . This enables a systematic way of exploiting auxiliary information when available. We demonstrate the effectiveness of this approach in our face verification experiment using auxiliary pose labels (see Table ?).

## 4Experiments

**Setup.** We evaluate our convolutional decision jungle (CDJ) on three standard image classification datasets: *Oxford-IIIT Pet* [15], *CIFAR-100* [10], and *Caltech-101* [3]; and one face verification dataset: *Multi-PIE* [5]. The *Oxford-IIIT Pet* dataset consists of 7,349 images covering 37 different breeds of cats and dogs (e.g, *Shiba Inu* and *Yorkshire Terrier*). We adopt the training and test dataset split of 3,680 and 3,669 images respectively [15]. The *CIFAR-100* dataset contains 60,000 natural images of 100 classes that are categorized into 20 superclasses. We use a 50,000 training and 10,000 testing image split [25]. The *Caltech-101* dataset contains 9,146 images of 101 object categories. 30 images are chosen from each category for training and the remaining images (i.e, at most per category) are used for testing [8].

The *Multi-PIE* (Session 1 subset) face verification dataset consists of photographs of 250 individuals taken at 20 different illumination levels and 15 poses ranging from to . We proceed as per [23]: For training, we use all images (15 poses, 20 illumination levels) of the first 150 individuals, while for testing we use one frontal view with neutral illumination (ID07 entries) as the gallery image for each of the remaining 100 subjects. The rest of the images are used as probes. We use the responses of the penultimate layer of the trained networks as features for cosine distance-based matching.

For each dataset, we initialize our algorithm with a pre-trained standard CNNs that achieve state-of-the-art performance: AlexNet [11] and VGG-16 [20] for *Oxford-IIIT Pet* and *Caltech-101*, respectively, and NiN [13] for *CIFAR-100*. For *Multi-PIE*, previous state-of-the-art results were achieved by c-CNN Forests [23] that embed a decision tree structure into a *small* CNN (Sec. Section 2). Empirically, large-scale AlexNet achieves higher accuracy than [23], and so we adopt AlexNet as our baseline.

The training parameters, including the network topology, mini-batch sizes, number of training epochs, and dropout and local response normalization decisions are adopted from these individual networks: Mini-batch sizes are 256, 32, 100 for AlexNet, VGG-16, and NiN, respectively. The number of epochs are 60 for AlexNet and VGG-16, and 100 for NiN. The learning rate is scheduled from to at every epoch for AlexNet, from to for VGG-16. Our learning rate for the NiN architecture is overall slightly smaller than [13], being fixed at until the 80-th epoch, and reduced to and after 80-th and 90-th epochs, respectively: This small learning rate ensures that the energy functional decreases constantly. For our new entropy energy, we use larger mini-batch sizes of ( being the number of classes) to ensure that the class distributions are balanced within a batch.

The class balancing constraint (Eq. Equation 9) causes our entropy energy to apply only to the layers whose sizes () are greater than or equal to the number of classes (): We apply the entropy energy from the second convolutional layers for AlexNet and VGG-16, and from *cccp3* layer for NiN. In this case, we regard the earlier layers—which do not use the entropy loss—as feature extractors for a decision jungle.

**Results.** Table ? compares the results of our method to eleven state-of-the-art methods across three image classification problems. Overall, our algorithm significantly improves upon all baselines, which demonstrates the effectiveness of our new entropy energy. Our method consistently outperforms state-of-the-art methods including DSN [12] and DDN [14]. In particular, our algorithm improves upon existing algorithms that augment the standard softmax loss functions similarly to ours, including the set-based *magnet loss* [16] (which further adopts data augmentation to improve the performance) and *clustering loss* [14]. Our network attains routing from supervised class information rather than from unsupervised sample distances, as in these previous works, which helps learn more discriminative features.

Figure 3 shows the effect of varying hyper-parameters: (a) and (b) show the testing accuracy with respect to epoch and , respectively; (c) shows the average entropy of each layer in CNNs and CDJs: For both networks, entropy becomes lower toward the last layers. Minimizing the standard softmax loss in CNN also tends to improve the class purity around the last layer. However, by explicitly enforcing it, our algorithm achieved constantly lower entropy. The strong correlation between the testing accuracy (a) and the average entropy of each layer (c) demonstrates that our new entropy contributes to improving generalization performance.

Table ? shows the results of different algorithms on the *Multi-PIE* face verification dataset. Previously, the state-of-the-art performance on *Multi-PIE* dataset was achieved by c-CNN forests [23] which embed binary decision trees into a CNN architecture. This architecture is applicable to only moderate-scale networks as the network size increases exponentially with respect to increasing number of splits. In contrast, CDJ enables to build large-scale decision networks by embedding computationally efficient decision jungles into CNNs.

We observed that by simply training (large-scale) AlexNet already outperforms c-CNN forests (Fig. ?). Our CDJ building upon AlexNet, is on par with the original AlexNet when the class labels are used to build the entropy energy. However, our network offers flexibility to incorporate any auxiliary labels, leading to significantly improved accuracy over AlexNet when auxiliary pose information is exploited. This demonstrates the utility of CDJ as a flexible alternative to c-CNN forest [23].

## 5Conclusion

We propose deep convolutional decision jungles (CDJs), which share the benefits of both decision jungles and CNNs: class-wise purity at each node and end-to-end feature learning/classification. Compared to the existing combination of decision graphs and CNNs (e.g, c-CNN Forests [23]) that already outperform CNNs, our model offers higher flexibility in routing data and enables us to exploit large-scale CNN architectures. This is facilitated by our new decision jungle architecture that provides multiple soft routing possibilities. Our CDJ also offers a systematic way of exploiting auxiliary information when available. Applied to three image classification problems and a face verification problem, our algorithms demonstrate improved accuracy over state-of-the-art methods.

### Footnotes

- The weight vectors of the fully-connected layers are regarded as convolution filters of size 1.
- More precisely, it should be regarded as the inverse of class
*purity*as is not a probability distribution.

### References

**Conditional computation in neural networks for faster models.**

E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup. In*Proc. ICLR workshop track*, 2016.**Neural decision forests for semantic image labeling.**

S. R. Bulò and P. Kontschieder. In*Proc. CVPR*, 2014.**One-Shot learning of object categories.**

L. Fei-Fei, R. Fergus, and P. Perona.*TPAMI*, 2006.**Maxout networks.**

I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. In*Proc. ICML*, 2013.**Multi-pie.**

R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker.*Image and Vision Computing*, 2010.**Deep roots: Improving CNN efficiency with hierarchical filter groups.**

Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi. In*Proc. CVPR*, 2017.**Decision forests, convolutional networks and the models in-between.**

Y. Ioannou, D. Robertson, D. Zikic, P. Kontschieder, J. Shotton, M. Brown, and A. Criminisi. In*ArXiv 1603.01250*, 2016.**Learning discriminative features via label consistent neural network.**

Z. Jiang, Y. Wang, L. Davis, W. Andrews, and V. Rozgic. In*Proc. WACV*, 2017.**Deep neural decision forests.**

P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò. In*Proc. ICCV*, 2015.**Learning multiple layers of features from tiny images.**

A. Krizhevsky.*Technical report, University of Toronto*, 2009.**ImageNet classification with deep convolutional neural networks.**

A. Krizhevsky, I. Sutskever, and G. E. Hinton. In*Proc. NIPS*, 2012.**Deeply-supervised nets.**

C.-Y. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu. In*Proc. AISTATS*, 2015.**Network in network.**

M. Lin, Q. Chen, and S. Yan. In*Proc. ICLR*, 2014.**Deep decision network for multi-class image classification.**

V. N. Murthy, V. Singh, T. Chen, R. Manmatha, and D. Comaniciu. In*Proc. CVPR*, 2016.**Cats and dogs.**

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. In*Proc. CVPR*, 2012.**Metric learning with adaptive density discrimination.**

O. Rippel, M. Paluri, P. Dollár, and L. Bourdev. In*Proc. ICLR*, 2016.**Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.**

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. In*Proc. ICLR*, 2017.**Decision jungles: compact and rich models for classification.**

J. Shotton, T. Sharp, P. Kohli, S. Nowozin, J. Winn, and A. Criminisi. In*Proc. NIPS*, 2013.**Fisher vector faces in the wild.**

K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman. In*Proc. BMVC*, 2013.**Very deep convolutional networks for large-scale image recognition.**

K. Simonyan and A. Zisserman. In*Proc. ICLR*, 2015.**Going deeper with convolutions.**

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. In*Proc. CVPR*, 2015.**A discriminative feature learning approach for deep face recognition.**

Y. Wen, K. Zhang, Z. Li, and Y. Qiao. In*Proc. ECCV*, 2016.**Conditional convolutional neural network for modality-aware face recognition.**

C. Xiong, X. Zhao, D. Tang, K. Jayashree, S. Yan, and T.-K. Kim. In*Proc. ICCV*, 2015.**Joint unsupervised learning of deep representations and image clusters.**

J. Yang, D. Parikh, and D. Batra. In*Proc. CVPR*, 2016.**Stochastic pooling for regularization of deep convolutional neural networks.**

M. D. Zeiler and R. Fergus. In*Proc. ICLR*, 2013.**Visualizing and understanding convolutional networks.**

M. D. Zeiler and R. Fergus. In*Proc. ECCV*, 2014.**Deep learning identity-preserving face space.**

Z. Zhu, P. Luo, X. Wang, and X. Tang. In*Proc. ICCV*, 2013.