Training decision trees as replacement for convolution layers
Abstract
We present an alternative layer to convolution layers in convolutional neural networks (CNNs). Our approach reduces the complexity of convolutions by replacing it with binary decisions. Those binary decisions are used as indexes to conditional probability distributions where each probability represents a leaf in a decision tree. This means that only the indices to the probabilities need to be determined once, thus reducing the complexity of convolutions by the depth of the output tensor. Index computation is performed by simple binary decisions that require fewer CPU cycles compared to conventionally used multiplications. In addition, we show how convolutions can be replaced by binary decisions. These binary decisions form indices in the conditional probability distributions and we show how they are used to replace 2D weight matrices as well as 3D weight tensors. These new layers can be trained like convolution layers in CNNs based on the backpropagation algorithm, for which we provide a formalization. Our results on multiple publicly available data sets show that our approach outperforms conventional CNNs. Beyond the formalized reduction of complexity and the improved qualitative performance, we show empirically a significant runtime improvement compared to convolution layers.
Training decision trees as replacement for convolution layers
Wolfgang Fuhl Eberhard Karls University Tübingen wolfgang.fuhl@unituebingen.de Gjergji Kasneci Eberhard Karls University Tübingen gjergji.kasneci@unituebingen.de Wolfgang Rosenstiel Eberhard Karls University Tübingen Wolfgang.Rosenstiel@unituebingen.de Enkelejda Kasneci Eberhard Karls University Tübingen enkelejda.kasneci@unituebingen.de
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\end@float
1 Introduction and Related Work
Conditioning CNNs is a modern approach to reducing runtime which is typically achieved by activating only parts of the models or by pursuing the scalability of model complexity ShazeerMMDLHD17; ioannou2016decision; chen2018neural to reduce computational costs without compromising accuracy. Recent approaches even reduce the complexity of convolution layers keskin2018splinenets without affecting the accuracy. This paper describes a new approach for the practical implementation of conditional neural networks using conditional distributions and binary decisions. Similar to keskin2018splinenets, we replace convolutional layers to reduce computational complexity with the addition of indexing by simple binary decisions. We show analytically and empirically the reduction of the computational runtime on the basis of public data sets as well as the retention or increase of the accuracy of the model.
There are four main categories of Conditional Neural Networks:
 1

Neural Networks that use loss functions for optimizing decision parameters.
 2

Probabilistic approaches that learn a selection of experts.
 3

Neural networks with decision tree architectures.
 4

Replacement layers for the convolutions, which map hierarchical decision graphs conditionally to the input feature space.
The first category uses nondifferentiable decision functions where the parameters for these are learned by an additional loss function. A loss function which maximizes the distances of the subcluster was presented in xiong2015conditional. The path loss function is used in baek2017deep. This is based on the purity of the data activation with respect to its class label distribution. The information gain is used in bicici2018conditional to learn an evaluation function which allows to activate paths through the network.
In the second category, probabilistic approaches are pursued. Weights are assigned to each branch and treated as a sum over a loss function ioannou2016decision. A similar approach is followed in ShazeerMMDLHD17. The main difference is that a very high number of branches per layer is considered and the best k branches are followed in the training phase as well as in the test phase. Another approach trains two neural networks where one provides the decision probability at the output and the second network performs the classification kontschieder2015deep. Both nets are trained jointly.
In the third category, the architecture of the neural network is similar to a decision tree. Randomized multilayer perceptrons are used in rota2014neural as branch accounts and trained together with the entire net. An alternative architecture is presented in denoyer2014deep. Here, each account in a net has three possible subsequent nodes. The selection of the following node is done via an evaluation function which is learned via the REINFORCE algorithm denoyer2014deep. In wang2017using, partitioning features are learned which make it possible to train the whole network with the backpropagation algorithm. The architecture of the network corresponds to that of a binary decision tree. Each node in this network represents a splitting and has therefore exactly two outputs where only one can be active at a time wang2017using.
The fourth and last category includes approaches that represent new layers in a neural network. Spatial transformation networks jaderberg2015spatial learn a transformation of the input tensor, which simplifies further processing in the network. In general this is a uniform representation of the input tensor which can be understood as spatial alignment. Since the accuracy of a mesh depends not only on the input, but also on the filters in convolution layers, jia2016dynamic introduces a layer that learns to generate optimal filters based on the input. This layer consists of a small neural network with convolution and transposed convolution layers. A further possibility for the conditional adaptation of neural networks is the configuration of the weights over a temporal course as it was realized in holden2017phase over a phase function. The authors of holden2017phase used a CatmullRom spline as phase function which can also be replaced by a neural network. The additive component analysis murdock2017additive however tries to realize a nonlinear dimension reduction by an approximation of additive functions. This is also defined as a fully connected layer and can be connected and trained in several layers. An approach based on this are the SplineNets keskin2018splinenets which assign a new interpolated value to a learned spline via the response of a learned filter in the previous layer. This spline makes the function differentiable and several of these layers one behind the other can be understood as a topological graph.
Our novel approach is based on the idea of SplineNets keskin2018splinenets to reduce convolutional complexity by simply mapping input characteristics to interpolated values. In addition, we simplify index generation with the general idea of binary neural networks courbariaux2015binaryconnect. For this we use conditional probabilities like random ferns bosch2007image. The indices are determined based on simple larger, smaller comparisons between input values. These indices are used to select probability values from several distributions and multiply them by the input values. The indices itself are the evaluation of the decision tree and the selected probability value is the leaf weight. This means that we consider both the values in the distributions and the input and output values as probabilities. This allows us to train the whole new layer with the backpropagation algorithm together with the whole net, as well as to connect several layers in series. The reduction of the computation complexity comes like with SplineNets keskin2018splinenets by the indexing which has to be calculated only once and not like with convolution layers, where a new convolution has to be calculated for each filter. In addition, our layer does not have to learn function parameters or perform expensive multiplications to generate the indices.
Due to the conditional probabilities which are trained holistically in one layer, our approach belongs to category 4. Since the indices generation is based on comparisons and random ferns bosch2007image represent a concretisation of random forests breiman2001random, our approach also belongs to category 3. This means that it is a hybrid approach which is formalized as an independent layer but contains decision tree structures.
Our contributions in this work are:
 1

A new layer that selects leaf weights based on binary decisions.
 2

The approximation of filters for index generation by binary decisions.
 3

A differentiable formal definition of the forward execution which is suitable for the backpropagation algorithm.
 4

Analytical and empirical evaluation of the quality and runtime improvement compared to CNNs.
2 Methodology
The Figure 1 shows the core concept of our process. Random Ferns are binary decisions that are linked to conditional probabilities (see Figure 1). The binary decisions themselves represent the conditions. This means that it is a decision tree. Since each binary decision is always evaluated, the structure of this tree is arbitrary under the condition that each decision function must be contained once in each path, which makes the decision tree a balanced tree.
(1) 
Equation 1 describes the evaluation of such a decision tree or Fern. is the input tensor, the probability distribution (see Figure 1) and the indices of the comparisons. To use this decision tree now like a convolution the indices in refer only to values in an input window which is moved over the whole input tensor (see Figure 2). To combine several of these decision trees, the probabilities are multiplied. In the case of Equation 1, this would be the centered input values at the current window position making it easy to determine the derivative and thus the gradient. Another simplification of Equation 1 is to compare all positions in using only the central value (see Figure 2). This simplifies the back propagation of the error.
(2) 
This leads to Equation 2 which describes the evaluation of the decision tree for an input window. In the case of convolutions, this input window is not necessarily two dimensional, but also a tensor of weights. This tensor is represented by several probability distributions. Each depth value of the input tensor has its own probability distribution as with convolutions, where each depth uses its own twodimensional weight matrix (see Figure 3).
This means that in the case of decision trees, each input depth has its own decision tree in the sense of its own probability distribution. For Equation 2 this means that each depth of the input tensor with depth indexes its own probability distribution over the same indexes .
(3) 
Equation 3 describes the calculation where it has to be taken into account that each depth performs a multiplication with the central probability and at the end, as with convolutions, the sum of all probabilities is computed. This summation makes it easier to determine the gradients for each distribution because there are no multiplicative dependencies between the distributions.
The next step describes the layer depth of the decision trees so that these decision trees can now also be used like convolution layers in neural networks (see Figure 4). As in the previous step, the same indexes are used for all layers but different probability distributions are used for each layer. The reason for this is that the complexity of the calculation is reduced compared to convolutions.
Complexity: The calculation of a convolution layer with the input tensor t and ntimes the convolution window c requires multiplications and additions. The decision trees, on the other hand, only have to determine the indices once, so that can be set, thus reducing the complexity by the output depths. In addition, the multiplications are replaced by simple larger or smaller comparisons and a multiplication. From this it follows that comparisons are performed and multiplications and additions.
(4) 
To extend Equation 3 in this respect, each individual output layer must be assigned a set of probability distributions . Equation 4 describes this change, but it is important to make sure that every tree uses the same indexes.
A disadvantage of the approach presented so far is that the size of the distributions grows exponentially . This means that the memory requirements can very quickly reach the limits of modern calculators and the numerical calculation of very small numbers in large distributions can become too inaccurate. Another disadvantage of large distributions, i.e. a large number of binary comparisons, is that the probability that an index will be used during training decreases the larger the distribution is. For a convolution of the size a distribution size of would be needed, which contains all comparisons with the central value. In order to make it possible to use several small distributions and still make it possible to cover larger input windows, we use the idea of inception architecture szegedy2015going. This means that different index sets with depth associated with different distributions are aggregated in an output tensor. In our implementation we used the summation per layer.
(5) 
Equation 5 describes the complete forward propagation per output layer of the presented new method for training decision trees in neural networks. All binary decision sets , with amount of sets are used to compute the index for the assigned distributions . The sum of all selected probabilities in multiplied with their corresponding input probabilities is calculated for each input window and written into the output tensor . The bias term itself is omitted in the formulas to simplify them but is used as in conventional convolution layers.
The backward propagation of the error occurs inversely to the forward propagation. This means that as with convolution layers, a convolution with the error tensor takes place for each input value of the input tensor.
(6) 
Equation 6 describes the back propagation where is the depth of the output tensor. Thus each value of the input layer is assigned the sum of the errors multiplied by the indexed probabilities . In addition, for each value participated in the binary decisions the error is added divided by the size of the used binary decision set (Equation 7).
(7) 
Equation 7 is calculated for each index in each used binary decision set and sums the error over the output tensor of the depth . The division by the record size results in an equal share of the error being assigned to each index. This is due to the fact that the participation in the resulting error is independent of the binary value of the evaluation from the decision function.
(8) 
To determine the gradient, only the derivation between the generated error and the input needs to be considered. This is described in Equation 8 and shows that only the central value of the input window and the output value are required. For the binary decision functions, the derivation is 0, since these are independent of the probability value in the distribution.
3 Experiments
Figure 6 shows all architectures used in this evaluation. In each architecture, conventional convolutions and the proposed trees were used separately. The first net corresponds to the architecture of LeNet5 lecun1998gradient but we used the sigmoid function instead of the hyperbolic tangent. The top right of the Figure 6 shows the indexes used for the trees. In the case of a convolution, the two indices under were used. For convolutions the four index sets under were added additionally to the two sets from . We have used these fixed indexes to facilitate the reproducibility, since we train each model with a random initialization. To further simplify the reproducibility of the results, no data manipulation was used and for each experiment, the input is a gray scale image normalized to the range . For the MNIST images we upscaled them to pixels using OpenCV opencv_library version 3.1 with the linear interpolation method.
In addition, stochastic gradient descent (SGD) was used as the optimization method with the L2 loss function in every experiment without the usage of momentum. The batch size was set to 1 for the MNIST lecun1998gradient data set (Table 1) and to 10 for CIFAR10 krizhevsky2009learning (Table 2) and CIFAR100 krizhevsky2009learning (Table 3). In the case of MNIST, the net was always evaluated on all test data after training on 5% of the training data. For CIFAR10 and CIFAR100, this evaluation was done after training on 50% of the training data.
For the selection of training data before each evaluation, all training data was evaluated and sorted by loss. In the case of MNIST, each batch was randomly selected from the worst 10% of training data. For CIFAR10 and CIFAR100, the worst 50% of the training data was used to randomly generate each batch. Exactly the same settings were used for our decision trees as well as for the convolutions. The training and test data split was used exactly according to the MNIST, CIFAR10 and CIFAR100 data set split. Neither parts of the evaluation nor the training data were excluded. We always report the best accuracy () as it is done for the stateoftheart approaches. As stateoftheart representatives we selected the top 3 methods from this website.
The hardware on which we conducted the training as well as the evaluation has an Intel i54570 CPU with 3.2 GHz, 16 GB DDR4 RAM and a Windows 7 Professional 64 Bit operating system (Service Pack 1).
Method  LeNet5(Tree)  LeNet5(Conv.)  wan2013regularization  cirecsan2012multi  sato2015apac 

Accuracy  99.73  99.72  99.79  99.77  99.77 
Table 1 shows the results of our adapted LeNet5 model compared to the stateoftheart. In the case of convolutions (99.72% achieved after 114 evaluations or 5.7 epochs), our version even surpasses the original LeNet5 lecun1998gradient, which achieved an accuracy of 99.05%. The result is even comparable with today’s stateoftheart, where these approaches have applied all data augmentation. In comparison, our decision tree approach achieves a similar and slightly better accuracy (99.73% achieved after 195 evaluations or 9.75 epochs). It exceeds the original LeNet5 lecun1998gradient and improves 0.01% over the use of convolutions. If the runtime is also considered (Table 4), it can be seen that the use of the decision trees requires only one third of the computing time in comparison to the convolutions (evaluation on only one CPU core).
Method  ACI10(Tree)  ACI10(Conv.)  DBLP:journals/corr/Graham14a  springenberg2014striving  mishkin2015all 

Accuracy  98.41  94.53  96.53  95.59  94.16 
Table 2 shows the comparison of our ACI10 model with convolution and decision trees compared to the stateoftheart on the CIFAR10 dataset. Our model with convolutions already achieves a comparable classification result (94.53% achieved after 153 evaluations or 76.5 epochs) with a runtime of 2ms (Table 4) on one CPU core. However, it is also 2% behind the approach of DBLP:journals/corr/Graham14a but it has to be mentioned that we did not use any data augmentation. In comparison, the decision tree exceeds the classification result (98.41% achieved after 464 evaluations or 232 epochs) by 1.88% with a runtime of 0.7ms (Table 4). This represents a significant improvement in the area of classification as well as in the computation time.
Method  ACI100(Tree)  ACI100(Conv.)  clevert2015fast  graham2014spatially  DBLP:journals/corr/Graham14a 

Accuracy  83.16  78.28  75.72  75.7  73.61 
Table 3 shows the comparison of our ACI100 model with convolution and decision trees compared to the stateoftheart on the CIFAR100 dataset. The ACI100 model differs only in an increased number of neurons in the penultimate fully connected layer (134 instead of 64 neurons) and the number of output neurons that had to be increased to 100 instead of 10 to correspond to the number of classes. Our model already exceeds the state of the art by 2.56% with convolutions (78.28% achieved after 148 evaluations or 74 epochs). By using the decision trees instead of the convolutions, the classification result is improved by 7.44% (83.16% achieved after 496 evaluations or 248 epochs). The runtime of the decision trees compared to the folds is about half. This increased runtime is due to the enlargement of the fully connected layers at the end.
Architecture  LeNet5  ACI10  ACI100  WING 

C=Convolution  1.4  2.0  2.2  220.3 
C=Tree  0.5  0.7  0.9  80.8 
Table 4 shows an overview of all runtimes of the models used using convolution and decision trees. In this context, a larger model (WING feng2018wing) was also considered, which is used in landmark detection. All runtime evaluations were performed on a single CPU core to ensure reproducibility and to simplify the comparison to other hardware environments. In the case of the largest model (WING see Figure 6), the reduction of the runtime becomes particularly clear through our approach. Our approach reduces the runtime to one third and the most computationally intensive part are the fully connected layers at the end. The reduction of the runtime to one third is also given for LeNet5 as well as ACI10. In the case of ACI100, the runtime was only reduced by half, but this is only due to the increase in the last neural layers. In Figure 6 it can be seen that the models ACI10 and ACI100 only differ in the last fully connected layers.
Distribution / Index set size  
Amount of sets  2  4  6 
2  88.22  89.9  89.71 
4  86.55  89.8  83.1 
6  88.23  88.64  83.7 
Table 5 shows the evaluation of the ACI100 on CIFAR100 model with different numbers of randomly selected index sets and set sizes in a window size of . The same parameters were used for the training as for the evaluation on the CIFAR100 data set (Table 3). As can be seen, compared to Table 3 where the fixed predefined indices sets from Figure 6 were used, there is still significant potential for increase in the presented approach. All randomly selected indices outperform the stateoftheart in Table 3.
4 Conclusions and Discussions
We presented a novel approach for training decision trees in neural network architectures using the back propagation algorithm and showed its advantages on several public data sets and in comparison with the stateoftheart.
From an industrial point of view, reducing the runtime while maintaining or even improving the predictive quality is a desirable improvement. However, the training of decision trees in neural networks also needs further research, as we think that is a very promising direction for the efficient, large scale application of deep learning. In our evaluations we have limited ourselves only to classification whereby also regression as in the case of landmark recognition is an important application aspect of neural networks. Furthermore, no residual layers were evaluated in which the reduction of the computational complexity by the depth of the output layer should result in a significant reduction of the runtime. Further interesting possibilities are the use of indexing sets with different depths and the reduction of the decision trees to only necessary paths. Here the authors see further opportunities for the reduction of the computation time.