Training Decision Trees as Replacement for Convolution Layers
We present an alternative layer to convolution layers in convolutional neural networks (CNNs). Our approach reduces the complexity of convolutions by replacing it with binary decisions. Those binary decisions are used as indexes to conditional distributions where each weight represents a leaf in a decision tree. This means that only the indices to the weights need to be determined once, thus reducing the complexity of convolutions by the depth of the output tensor. Index computation is performed by simple binary decisions that require fewer cycles compared to conventionally used multiplications. In addition, we show how convolutions can be replaced by binary decisions. These binary decisions form indices in the conditional distributions and we show how they are used to replace 2D weight matrices as well as 3D weight tensors. These new layers can be trained like convolution layers in CNNs based on the backpropagation algorithm, for which we provide a formalization.
Our results on multiple publicly available data sets show that our approach performs similar to conventional neuronal networks. Beyond the formalized reduction of complexity and the improved qualitative performance, we show the runtime improvement empirically compared to convolution layers.
Introduction and Related Work
Conditioning CNNs is a modern approach to reducing runtime which is typically achieved by activating only parts of the models or by pursuing the scalability of model complexity [33, 16, 7] to reduce computational costs without compromising accuracy. Recent approaches even reduce the complexity of convolution layers  without affecting the accuracy. This paper describes a new approach for the practical implementation of conditional neural networks using conditional distributions and binary decisions. Similar to , we replace convolutional layers to reduce computational complexity with the addition of indexing by simple binary decisions. We show analytically and empirically the reduction of the computational runtime on the basis of public data sets as well as the retention or increase of the accuracy of the model.
There are four main categories of Conditional Neural Networks:
Neural Networks that use loss functions for optimizing decision parameters.
Probabilistic approaches that learn a selection of experts.
Neural networks with decision tree architectures.
Replacement layers for the convolutions, which map hierarchical decision graphs conditionally to the input feature space.
The first category uses non-differentiable decision functions where the parameters for these are learned by an additional loss function. A loss function which maximizes the distances of the subcluster was presented in . The path loss function is used in . This is based on the purity of the data activation with respect to its class label distribution. The information gain is used in  to learn an evaluation function which allows to activate paths through the network.
In the second category, probabilistic approaches are pursued. Weights are assigned to each branch and treated as a sum over a loss function . A similar approach is followed in . The main difference is that a very high number of branches per layer is considered and the best k branches are followed in the training phase as well as in the test phase. Another approach trains two neural networks where one provides the decision probability at the output and the second network performs the classification . Both nets are trained jointly.
In the third category, the architecture of the neural network is similar to a decision tree. Randomized multi-layer perceptrons are used in  as branch accounts and trained together with the entire net. An alternative architecture is presented in . Here, each account in a net has three possible subsequent nodes. The selection of the following node is done via an evaluation function which is learned via the REINFORCE algorithm . In , partitioning features are learned which make it possible to train the whole network with the backpropagation algorithm. The architecture of the network corresponds to that of a binary decision tree. Each node in this network represents a splitting and has therefore exactly two outputs where only one can be active at a time .
The fourth and last category includes approaches that represent new layers in a neural network. Spatial transformation networks  learn a transformation of the input tensor, which simplifies further processing in the network. In general this is a uniform representation of the input tensor which can be understood as spatial alignment. Since the accuracy of a mesh depends not only on the input, but also on the filters in convolution layers,  introduces a layer that learns to generate optimal filters based on the input. This layer consists of a small neural network with convolution and transposed convolution layers. A further possibility for the conditional adaptation of neural networks is the configuration of the weights over a temporal course as it was realized in  over a phase function. The authors of  used a Catmull-Rom spline as phase function which can also be replaced by a neural network. The additive component analysis  however tries to realize a non-linear dimension reduction by an approximation of additive functions. This is also defined as a fully connected layer and can be connected and trained in several layers. An approach based on this are the SplineNets  which assign a new interpolated value to a learned spline via the response of a learned filter in the previous layer. This spline makes the function differentiable and several of these layers one behind the other can be understood as a topological graph.
Our novel approach is based on the idea of SplineNets  to reduce convolutional complexity by simply mapping input characteristics to interpolated values. In addition, we simplify index generation with the general idea of binary neural networks . For this we use conditional distributions like random ferns . The indices are determined based on simple larger, smaller comparisons between input values. These indices are used to select weights from several distributions and multiply them by the input values. The indices itself are the evaluation of the decision tree and the selected weight is the leaf node. This means that we consider both the values in the distributions and the input and output values as probabilities. This allows us to train the whole new layer with the backpropagation algorithm together with the whole net, as well as to connect several layers in series. The reduction of the computation complexity comes like with SplineNets  by the indexing which has to be calculated only once and not like with convolution layers, where a new convolution has to be calculated for each filter. In addition, our layer does not have to learn function parameters or perform expensive multiplications to generate the indices.
Due to the conditional weights which are trained holistically in one layer, our approach belongs to category 4. Since the indices generation is based on comparisons and random ferns  represent a concretisation of random forests , our approach also belongs to category 3. This means that it is a hybrid approach which is formalized as an independent layer but contains decision tree structures.
Our contributions in this work are:
A new layer that selects leaf weights based on binary decisions.
The approximation of filters for index generation by binary decisions.
A differentiable formal definition of the forward execution which is suitable for the backpropagation algorithm.
Analytical and empirical evaluation of the quality and runtime improvement compared to CNNs.
The Figure 1 shows the core concept of our process. Random Ferns are binary decisions that are linked to conditional weights (see Figure 1). The binary decisions themselves represent the conditions. This means that it is a decision tree. Since each binary decision is always evaluated, the structure of this tree is arbitrary under the condition that each decision function must be contained once in each path, which makes the decision tree a balanced tree.
Equation 1 describes the evaluation of such a decision tree or Fern. is the input tensor, the distribution (see Figure 1) and the indices of the comparisons. To use this decision tree now like a convolution the indices in refer only to values in an input window which is moved over the whole input tensor (see Figure 2). To combine several of these decision trees, the weights are multiplied. In the case of Equation 1, this would be the centered input values at the current window position making it easy to determine the derivative and thus the gradient. Another simplification of Equation 1 is to compare all positions in using only the central value (see Figure 2). This simplifies the back propagation of the error.
This leads to Equation 2 which describes the evaluation of the decision tree for an input window. In the case of convolutions, this input window is not necessarily two dimensional, but also a tensor of weights. This tensor is represented by several distributions. Each depth value of the input tensor has its own distribution as with convolutions, where each depth uses its own two-dimensional weight matrix (see Figure 3).
This means that in the case of decision trees, each input depth has its own decision tree in the sense of its own distribution. For Equation 2 this means that each depth of the input tensor with depth indexes its own distribution over the same indexes .
Equation 3 describes the calculation where it has to be taken into account that each depth performs a multiplication with the central value and at the end, as with convolutions, the sum of all multiplications is computed. This summation makes it easier to determine the gradients for each distribution because there are no multiplicative dependencies between the distributions.
The next step describes the layer depth of the decision trees so that these decision trees can now also be used like convolution layers in neural networks (see Figure 4). As in the previous step, the same indexes are used for all layers but different distributions are used for each layer. The reason for this is that the complexity of the calculation is reduced compared to convolutions.
Complexity: The calculation of a convolution layer with the input tensor t and n-times the convolution window c requires multiplications and additions. The decision trees, on the other hand, only have to determine the indices once, so that can be set, thus reducing the complexity by the output depths. In addition, the multiplications are replaced by simple larger or smaller comparisons and a multiplication. From this it follows that comparisons are performed and multiplications and additions.
To extend Equation 3 in this respect, each individual output layer must be assigned a set of distributions . Equation 4 describes this change, but it is important to make sure that every tree uses the same indexes.
A disadvantage of the approach presented so far is that the size of the distributions grows exponentially . This means that the memory requirements can very quickly reach the limits of modern computers and the numerical calculation of very small numbers in large distributions can become too inaccurate. Another disadvantage of large distributions, i.e. a large number of binary comparisons, is that the probability that an index will be used during training decreases the larger the distribution is. For a convolution of the size a distribution size of would be needed, which contains all comparisons with the central value. In order to make it possible to use several small distributions and still make it possible to cover larger input windows, we use the idea of inception architecture . This means that different index sets with depth associated with different distributions are aggregated in an output tensor. In our implementation we used the summation per layer.
Equation 5 describes the complete forward propagation per output layer of the presented new method for training decision trees in neural networks. All binary decision sets , with amount of sets are used to compute the index for the assigned distributions . The sum of all selected weights in multiplied with their corresponding input value is calculated for each input window and written into the output tensor . The bias term itself is omitted in the formulas to simplify them but is used as in conventional convolution layers.
The backward propagation of the error occurs inversely to the forward propagation. This means that as with convolution layers, a convolution with the error tensor takes place for each input value of the input tensor.
Equation 6 describes the back propagation where is the depth of the output tensor. Thus each value of the input layer is assigned the sum of the errors multiplied by the indexed weights . In addition, for each value participated in the binary decisions the error is added divided by the size of the used binary decision set (Equation 7).
Equation 7 is calculated for each index in each used binary decision set and sums the error over the output tensor of the depth . The division by the record size results in an equal share of the error being assigned to each index. This is due to the fact that the participation in the resulting error is independent of the binary value of the evaluation from the decision function.
To determine the gradient, only the derivation between the generated error and the input needs to be considered. This is described in Equation 8 and shows that only the central value of the input window and the output value are required. For the binary decision functions, the derivation is 0, since these are independent of the weights in the distribution.
Figure 6 shows the used index patterns for our evaluations. We used the models LeNet-5  with rectifier linear units (ReLu) instead of the hyperbolic tangent function and the deep residual model with depth 16 (ResNet-16) and 34 (ResNet-34). In both models (ResNet-16 and ResNet-34) we used a batch normalization block after each convolution or decision tree layer. The LeNet-5 model was used in the comparison on the MNIST  dataset with the index patterns TI2 and TI3 (Figure 6). The ResNet-34 was used for the comparison on the CIFAR10  dataset with the TI1 pattern. As an alternative evaluation for image classification we used landmark regression. Therefore, we compared the decision trees with convolutions on the 300W  dataset using the ResNet-16 and the TI1 patterns.
The general idea behind our experiments is not to surpass the state-of-the-art, but to compare decision trees with convolutions in the same architecture. For this purpose we tried to get as close as possible to the results of the state-of-the-art with simple means and to design the training process for convolutions and decision trees in the same way.
consists of 70,000 hand written digits and has therefore ten classes. Each image has a resolution of pixels and is provided as gray scale image. The training set contains 60,000 and the test set 10,000 images. This data set is size-normalized and centered and represents a subset of the larger NIST dataset. As evaluation metric the classification accuracy () is used. We only report the best result as it is done for the state-of-the-art [36, 8, 32].
consists of 60,000 color images with ten different categories. Each image has a resolution of pixels and is provided in the RGB format. The training set contains 50,000 and the test set 10,000 images. As evaluation metric the classification accuracy () is used. We only report the best result as it is done for the state-of-the-art [14, 34, 27].
is an aggregation of multiple datasets (LFPW , HELEN , AFW  and XM2VTS ). The training set consists of 3,148 face images from the LFPW and HELEN dataset. For the test set 689 images are provided. Each image has 68 annotated landmarks . In the evaluation the test set is separated into three categories the full set, the challenging set (iBUG, 135 images) and the common set (LFPW and HELEN, 554 images). As evaluation metric we used the normalized mean errors (NME) which corresponds to the average distance between detected and annotated landmark, normalized by the pixel distance between both eye centers. This is the same evaluation procedure as the state-of-the-art [13, 12, 29].
Training parameters for MNIST:
We used the Adam optimizer  with the first momentum set to 0.9 and the second momentum set to 0.999. Weight decay was set to for the convolutions and to for the decision trees. The batch size was set to 400 and each batch was always balanced in terms of available classes. This means that in each batch each class was represented 40 times. The initial learning rate was set to and reduced by after each 100 epochs until it reached . For the learning rate of we continued the training for additional 1000 epochs and selected the best result. For data augmentation we used random noise in the range of 0-30% of the image resolution.
Training parameters for CIFAR10:
We used the Adam optimizer  with the first momentum set to 0.9 and the second momentum set to 0.999. Weight decay was set to for the convolutions and to for the decision trees. The batch size was set to 50 with the same batch balancing approach as for the MNIST dataset. For CIFAR this means each batch consisted of five examples per class. The initial learning rate was set to and reduced by after each 500 epochs until it reached . For the learning rate of we continued the training for additional 1000 epochs and selected the best result. For data augmentation we used random cropping of patches, random color offsets, random color distortion, flipping the image horizontally and vertically as well as random noise in the range of 0-20% of the image resolution. Additionally, we overlayed patches of the same class with an intensity of up to 20%.
Training parameters for 300W:
All images where resized to pixels. We used the Adam optimizer  with the first momentum set to 0.9 and the second momentum set to 0.999. Weight decay was set to for the convolutions and to for the decision trees. The batch size was set to 30. All 20 iterations an evaluation of the landmark accuracy was performed for the test and the training set. The accuracy on the training set was used to balance the batches. This was done by splitting the training set into three categories. The first category are the most inaccurate 20%. For the second category we used the range between the first category and the most inaccurate 50%. The last category is the range between the second category and the most inaccurate 80%. For each batch we selected ten examples out of each category. The initial learning rate was set to and increased by after each 100 epochs until it reached . For the learning rate of we continued the training for additional 1000 epochs. Afterwards, we reduced the learning rate after each 100 epochs by until we reached and stopped the training. For data augmentation we used random noise in the range of 0-20% of the image resolution. The image and landmarks where randomly shifted by up to 20% of the image resolution into each direction. Additionally, we added randomly Gaussian blur (). For occlusions we overlayed up to three boxes and filled them either with a fixed random value or a random value for each pixel in the box. We also randomly changed the contrast of the image in the range [-40, 40].
Hardware and implementation:
For training and evaluation we used two different hardware setups. For LeNet-5 we used a desktop PC with an Intel i5-4570 CPU (3.2 GHz), 16 GB DDR4 RAM, NVIDIA GTX 1050Ti GPU with 4GB RAM and Windows 7 64 bit operating system. The second hardware setup was used for the ResNet models since those require more GPU RAM. Therefore, we used a server with an Intel i9-9900K CPU (3.6 GHz), 64 GB DDR4 RAM, two RTX 2080ti GPUs with 11.2GB RAM each and an Windows 8.1 64 bit operating system. We implemented the decision tree layer in C++ on the CPU and in CUDA on the GPU. The implementation was integrated into the DLIB  framework which uses CUDNN functions. An implementation for Tensorflow  and Torch  is also planned since those are currently the most popular frameworks.
Table 1 shows the results of our adapted Le-Net5 model. As can be seen the TI2 and TI3 pattern (Figure 6) perform similar to the convolutions. The TI2 pattern is an approximation of a convolution and achieves a classification accuracy of 99.23%. In comparison to this the convolutions as used in the LeNet-5 model achieve an accuracy of 99.37% which is an improvement of 0.14%. Approximating the convolutions with the TI3 pattern and the inception technique achieves 99.48. If the runtime is also considered (Table 4), it can be seen that the use of the decision trees requires only one third of the computing time in comparison to the convolutions (evaluation on only one CPU core). A disadvantage of the decision trees, on the other hand, is the increased memory consumption. In the case of the LeNet-5 model, both convolution layers require parameter (). The TI2 pattern needs parameters and the TI3 pattern needs ().
Table 2 shows the comparison between the TI1 pattern and the convolutions using the ResNet-34 model. As can be seen both achieve classification accuracies above 90%. The TI1 pattern performs slightly better in comparison to the convolutions (1.08 improvement). Comparing the runtime (Table 4) of both approaches it can be seen that the decision trees are significantly faster to compute (6.77ms vs 18.1ms). The memory consumption for one distribution of the TI1 pattern is 16 floats, for one convolution 9 floats (). This means that the parameters of the model are almost doubled while the runtime is only one third.
|LAB (8-stack) ||3.42||6.98||4.12|
Table 3 shows the results for landmark regression using the ResNet-16 model. As can be seen the convolutions and the decision trees achieve nearly the same result. We used the same pattern (TI1) as for the CIFAR10 classification which means that the memory consumption of the decision trees is nearly twice as high as for the convolutions. The runtime in contrast is halved (4,64ms vs 10.76ms).
|LeNet-5 (TI2)||1 CPU Core||0.18ms|
|LeNet-5 (TI3)||1 CPU Core||0.36ms|
|LeNet-5 (Conv.)||1 CPU Core||0.83ms|
|ResNet-34 (TI1)||GPU 1050ti||6.77ms|
|ResNet-34 (Conv.)||GPU 1050ti||18.1ms|
|ResNet-16 (TI1)||GPU 1050ti||4.64ms|
|ResNet-16 (Conv.)||GPU 1050ti||10.76ms|
Table 4 shows an overview of all runtimes of the models used using convolution and decision trees. All runtime evaluations were performed on a single CPU core (Intel i5-4570) or an NVIDIA 1050ti GPU to ensure reproducibility and to simplify the comparison to other hardware environments.
Conclusions and Discussions
We presented a novel approach for training decision trees in neural network architectures using the back propagation algorithm and showed that it is possible to achieve the same results as with convolutions. Classification and a regression experiment where conducted on publicly available datasets. The improved runtime of the decision trees was estimated theoretically and empirically shown for different models against the high performance CUDNN implementation from NVIDIA. From an industrial point of view, reducing the runtime while maintaining or even improving the predictive quality is a desirable improvement. In contrast to the runtime, the increased memory consumption is a disadvantage. Further research should investigate the use of indexing sets with different depths and the reduction of the decision trees to only necessary paths. Here the authors see further opportunities for the reduction of the computation time and the memory consumption. In addition, the decision trees could also be extended to only use binary weights as it is done in binary convolution neuronal networks . This would reduce the runtime and memory consumption.
Work of the authors is supported by the Institutional Strategy of the University of Tübingen (Deutsche Forschungsgemeinschaft, ZUK 63).
- (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: Hardware and implementation:.
- (2017) Deep convolutional decision jungle for image classification. arXiv preprint arXiv:1706.02003. Cited by: Introduction and Related Work.
- (2013) Localizing parts of faces using a consensus of exemplars. Pattern Analysis and Machine Intelligence 35 (12), pp. 2930–2940. Cited by: 300W.
- (2018) Conditional information gain networks. In International Conference on Pattern Recognition, pp. 1390–1395. Cited by: Introduction and Related Work.
- (2007) Image classification using random forests and ferns. In International Conference on Computer Vision, pp. 1–8. Cited by: Introduction and Related Work, Introduction and Related Work.
- (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: Introduction and Related Work.
- (2018) Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pp. 6571–6583. Cited by: Introduction and Related Work.
- (2012) Multi-column deep neural networks for image classification. arXiv preprint arXiv:1202.2745. Cited by: MNIST, Table 1.
- (2002) Torch: a modular machine learning software library. Technical report Idiap. Cited by: Hardware and implementation:.
- (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131. Cited by: Introduction and Related Work, Conclusions and Discussions.
- (2014) Deep sequential neural network. arXiv preprint arXiv:1410.0510. Cited by: Introduction and Related Work.
- (2018) Style aggregated network for facial landmark detection. In Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 6. Cited by: 300W, Table 3.
- (2018) Wing loss for robust facial landmark localisation with convolutional neural networks. In Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 3. Cited by: 300W, Table 3.
- (2014) Fractional max-pooling. CoRR abs/1412.6071. External Links: Cited by: CIFAR10, Table 2.
- (2017) Phase-functioned neural networks for character control. Transactions on Graphics 36 (4), pp. 42. Cited by: Introduction and Related Work.
- (2016) Decision forests, convolutional networks and the models in-between. arXiv preprint arXiv:1603.01250. Cited by: Introduction and Related Work, Introduction and Related Work.
- (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems, pp. 2017–2025. Cited by: Introduction and Related Work.
- (2016) Dynamic filter networks. In Advances in Neural Information Processing Systems, pp. 667–675. Cited by: Introduction and Related Work.
- (2018) SplineNets: continuous neural decision graphs. In Advances in Neural Information Processing Systems, pp. 1994–2004. Cited by: Introduction and Related Work, Introduction and Related Work, Introduction and Related Work.
- (2009) Dlib-ml: a machine learning toolkit. Journal of Machine Learning Research 10 (Jul), pp. 1755–1758. Cited by: Hardware and implementation:.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Training parameters for MNIST:, Training parameters for CIFAR10:, Training parameters for 300W:.
- (2015) Deep neural decision forests. In International Conference on Computer Vision, pp. 1467–1475. Cited by: Introduction and Related Work.
- (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: Experiments.
- (2012) Interactive facial feature localization. In European Conference on Computer Vision, pp. 679–692. Cited by: 300W.
- (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: Experiments.
- (1999) XM2VTSDB: the extended m2vts database. In Audio and Video-based Biometric Person Authentication, Vol. 964, pp. 965–966. Cited by: 300W.
- (2015) All you need is a good init. arXiv preprint arXiv:1511.06422. Cited by: CIFAR10, Table 2.
- (2017) Additive component analysis. In Conference on Computer Vision and Pattern Recognition, pp. 2491–2499. Cited by: Introduction and Related Work.
- (2016) Face alignment via regressing local binary features. Image Processing 25 (3), pp. 1233–1245. Cited by: 300W, Table 3.
- (2014) Neural decision forests for semantic image labelling. In Computer Vision and Pattern Recognition, pp. 81–88. Cited by: Introduction and Related Work.
- (2013) A semi-automatic methodology for facial landmark annotation. In Conference on Computer Vision and Pattern Recognition Workshops, pp. 896–903. Cited by: 300W.
- (2015) Apac: augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229. Cited by: MNIST, Table 1.
- (2017) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. CoRR abs/1701.06538. External Links: Cited by: Introduction and Related Work, Introduction and Related Work.
- (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: CIFAR10, Table 2.
- (2015) Going deeper with convolutions. In Computer Vision and Pattern Recognition, pp. 1–9. Cited by: Methodology.
- (2013) Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pp. 1058–1066. Cited by: MNIST, Table 1.
- (2017) Using a random forest to inspire a neural network and improving on it. In International Conference on Data Mining, pp. 1–9. Cited by: Introduction and Related Work.
- (2018) Look at boundary: a boundary-aware face alignment algorithm. In Computer Vision and Pattern Recognition, Cited by: Table 3.
- (2015) Conditional convolutional neural network for modality-aware face recognition. In International Conference on Computer Vision, pp. 3667–3675. Cited by: Introduction and Related Work.
- (2013) Supervised descent method and its applications to face alignment. In Computer Vision and Pattern Recognition, pp. 532–539. Cited by: Table 3.
- (2012) Face detection, pose estimation, and landmark localization in the wild. In Computer Vision and Pattern Recognition, pp. 2879–2886. Cited by: 300W, Experiments.
- (2016) Face alignment across large poses: a 3d solution. In Conference on Computer Vision and Pattern Recognition, pp. 146–155. Cited by: Table 3.