Rectified Decision Trees: Towards Interpretability, Compression and Empirical Soundness

Rectified Decision Trees: Towards Interpretability,
Compression and Empirical Soundness

Jiawang Bai equal contribution.    Yiming Li    Jiawei Li    Yong Jiang&Shutao Xia \affiliationsGraduate School at Shenzhen, Tsinghua University, China
Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, China \, {li-ym18, li-jw15}, {jiangy, xiast}

How to obtain a model with good interpretability and performance has always been an important research topic. In this paper, we propose rectified decision trees (ReDT), a knowledge distillation based decision trees rectification with high interpretability, small model size, and empirical soundness. Specifically, we extend the impurity calculation and the pure ending condition of the classical decision tree to propose a decision tree extension that allows the use of soft labels generated by a well-trained teacher model in training and prediction process. It is worth noting that for the acquisition of soft labels, we propose a new multiple cross-validation based method to reduce the effects of randomness and overfitting. These approaches ensure that ReDT retains excellent interpretability and even achieves fewer nodes than the decision tree in the aspect of compression while having relatively good performance. Besides, in contrast to traditional knowledge distillation, back propagation of the student model is not necessarily required in ReDT, which is an attempt of a new knowledge distillation approach. Extensive experiments are conducted, which demonstrates the superiority of ReDT in interpretability, compression, and empirical soundness.

Rectified Decision Trees: Towards Interpretability,
Compression and Empirical Soundness

Jiawang Bai**footnotemark: *, Yiming Li, Jiawei Li, Yong Jiangand Shutao Xia

Graduate School at Shenzhen, Tsinghua University, China
Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, China, {li-ym18, li-jw15}, {jiangy, xiast}

1 Introduction

Random forests is a typical ensemble learning method, where a large number of randomized decision trees are constructed and the results from all trees are combined for the final prediction of the forest. Since its introduction in [?], random forests and its several variants [??] have been widely used in many fields, such as deep learning [??] and even outlier detection [?]. In addition to its application, its theoretical properties have also been extensively studied. [??].

However, although those complicated algorithms, such as random forests and GBDT, reach great success in many aspects, this high prediction performance makes considerable sacrifices of interpretability. The essential procedures of ensemble approaches cause this decline. For example, comparing to decision trees, the bootstrap and voting process of random forests makes the predictions much more difficult to explain. On the contrary, the decision trees are known to have the best interpretability among all machine learning algorithms yet with relatively lousy performance. Besides, forest-based algorithms or even deep neural networks (DNN) usually require much larger storage than decision trees, which is unacceptable especially when the model is set on a personal device with strict storage limitations (such as a cellular device). This conflict between empirical soundness and interpretability with flexible storage continuously drives researchers.

To address these problems, using a tree ensemble to generate additional samples for the further construction of the decision tree is proposed by Breiman [?], which can be regarded as the first attempt for this problem. In [?], Node Harvest is introduced to simplify tree ensemble by using the shallow parts of the trees. The shortcoming of Node Harvest is that the simplified derived model is still an ensemble and therefore the challenge of interpretation remains. Recently, a distillation-based method is proposed, where the soften labels are generated by well-trained DNN to create a more understandable model in the form of a soft decision trees [?]. However, since this method relies on the backpropagation of the soft decision trees, it cannot be used in the classical decision trees. Besides, the interpretability of the soft decision trees [?] is much weaker than the classical decision trees.

In this paper, we propose rectified decision trees (ReDT), a knowledge distillation based decision trees rectification with high interpretability, empirical soundness and even has a smaller model size compared to the decision trees. The critical difference between ReDT and decision tree lies in the use of softening labels, which is the weighted average of soft labels (the output probability vector of a well-trained teacher model) and hard labels, in the process of building trees. Specifically, to construct a decision tree, the hard label is mainly involved in the two parts of the tree construction process: (1) calculating the change of impurity and (2) determining whether the node is pure in the stopping condition. In our method, we introduce soften labels into these processes. Firstly, we calculate the average of the soften labels in the node. The proportion of the samples with -th category needed in the calculation of impurity criterion is re-determined using the value of the -th dimension of the soften label. Secondly, since it is almost impossible for the mixed labels of all samples in a node to be the same, we propose to use a pseudo-category which is corresponding to the soften label of the sample. Then the original stopping condition can remain. In ReDT, the teacher model can be DNN or any other classification algorithm and therefore ReDT is universal. In contrast to traditional knowledge distillation, back propagation of the student model is not necessarily required in ReDT, which can be regarded as an attempt of a new knowledge distillation approach. Besides, we propose a new multiple cross-validation based method to reduce the effects of randomness and overfitting.

The main contributions of this paper can be stated as follows: 1) We propose a decision trees extension, which is the first tree that allows training and predicting using soften labels; 2) The first universal back propagation-free distillation framework is proposed and 3) the empirical analysis of its mechanism is conducted; 4) We propose a new soften labels acquisition method based on multiple cross-validations to reduce the effects of randomness and overfitting; 5) Extensive experiments demonstrate the superiority of our approach in interpretability, compression, and empirical soundness.

2 Related Work

The interpretability of complex machine learning models, especially ensemble approaches and deep learning, has been widely concerned. At present, the most widely used machine learning models are mainly forest-based algorithms and DNN, so their interpretability is of great significance. There are a few previous studies on the interpretability of forest-based algorithms. The first work is done by Breiman [?], who propose to use tree ensemble to generate additional samples for the further construction of a single decision tree. In [?], Node harvest is proposed to simplify tree ensembles by using the shallow parts of the trees. Considering the simplification of tree ensembles as a model selection problem, and using the Bayesian method for selection is also proposed in [?]. The interpretability research of DNN mainly on three aspects: visualizing the representations in intermediate layers of DNN [??], representation diagnosis [??] and build explainable DNNs [??]. Recently, a knowledge distillation based method is provided, which uses a trained DNN to create a more explainable model in the form of soft decision trees [?].

The compression of forest-based algorithms and DNN has also received extensive attention. A series of work focuses on pruning techniques for forest-based algorithms, whose idea is to reduce the size by removing redundant components while maintaining the predictive performance [???]. The idea of pruning is also widely used in the compression of DNN [??]. Recently, extensive researches have been conducted on compression methods based on coding or quantization. [??].

Recently, Knowledge distillation has been widely accepted as a compression method. The concept of knowledge distillation in the teacher-student framework by introducing the teacher’s softened output is first proposed in [?]. Since then, a series of improvements and applications of knowledge distillation have been proposed [??]. At present, almost all knowledge distillation focus on the compression of DNN and require the back-propagation of the student model. Besides, using knowledge distillation to distill DNN into a soften decision tree to achieve great interpretability and compressibility is recently proposed in [?]. This method can be regarded as the first attempt to apply knowledge distillation to interpretability.

3 The Proposed Method

We present the rectified decision trees (ReDT) in this section. The main concepts of our proposed method are how to define the important information of the teacher model (distilled knowledge) and how we use it in training the student model ( the ReDT). Section 3.1 introduces the distilled knowledge that we further used in the construction of ReDT. Section 3.2 and 3.3 discuss the specific construction and prediction process of ReDT. An empirical analysis, which demonstrates why soften labels can reach better performance than hard labels in the construction of the decision tree, is provided in section 3.4.

3.1 Distilled Knowledge

Let represents a data set consisting of observations. Each observation has the form , where represents the -dimensional features and is the corresponding label of the observation. The label of a sample can be regarded a single sampling from a -dimensional discrete distribution. Let hard label denotes the one-hot representation of the label. (-dimensional vector, where the value in the dimension corresponding to the category is 1 and the rest are all 0).

From the perspective of probability, the training of the model can be considered as an approximation of the distribution of data. It is extremely difficult to recover the true distribution of from the hard labels directly. In contrast, the output of a well-trained model consists of a significant amount of useful information compared to the original hard label itself. Inspired by this idea, we define the soft label , which is the output probability vector of a well-trained model such as DNN, random forests and GBDT, as the distilled knowledge from the teacher model. This idea is also partly supported by [?] where he used a softened version of the final output of a teacher network to teach information to a small student network.

Once a well-trained teacher model is given, the generation of the soft label is straightforward by directly outputting the probability of all training samples. However, the acquisition of teacher model is usually needed through training. The most straightforward idea is to train the teacher model using all training samples and output the soft label of those samples. However, the soft label obtained through this process has relatively poor quality due to the effects of randomness and overfitting. This problem does not exist in the previous knowledge distillation task since their training is carried out simultaneously rather than strictly one after the other, thanks for the teacher model and the student model can both be trained through back propagation. To address this problem, we propose a multiple cross-validation based methods to calculate soft labels. Specifically, if 5 times 5-fold cross validation is implemented, we first randomly divide the training set into five similarly sized sets, then using four sets of data for training, and the other set of data to predict ( output its soft label). In each time, each sample is predicted once so that each sample will end up with 5 soft labels. And the final soft label is the average of all its predictions.

3.2 ReDT Construction

In the proposed ReDT, comparing to the original decision tree, there are two main alterations including the calculation of impurity decrease and the stopping condition. In our method, we introduce soften label into these processes.

Note that instead of using the soft label of samples directly, we use the mixed label , which is the weighted average of soft label and hard label with weight hyperparameter . That is,


The hyperparameter plays a role in regulating the proportion of using the soft label. The larger , the smaller the proportion of the soft label in the mixed label. When , the ReDT becomes Breiman’s decision trees. The purpose of using mixed labels is to consider that the soft label may have a certain degree of error. By adjusting the hyperparameter , we can obtain the soften label with sufficient information and relative accuracy.

Recall that in the classification problem, the impurity decrease caused by splitting point is denoted by


where are two children sets generated by splitting at , is the impurity criterion ( Shannon entropy or Gini index). The first alteration in ReDT is the probability , which implies the proportion of the samples with -th category, used in calculating the impurity decrease of a splitting point. Specifically, since each sample uses a soften label instead of a hard label, we calculate the average of the soften labels of all the samples in the node and finally obtain a -dimensional vector. At this time, is redetermined as the value of the -th dimension of that vector. In other words, let denotes the mixed label of -th training sample. of node is calculated by


where denotes the number of samples in leaf node .

The second alteration is how to define pure in the stopping condition. In the training process of original decision trees, if all samples in a node have a single category, the node is considered to be pure. At this point, the stopping condition is reached, and this node is no longer to split. However, in the ReDT, it is almost impossible for the mixed labels of all samples to be the same. Therefore, we use the category corresponding to the maximum probability in the mixed label of the sample as its pseudo-category , ,


and then determining whether to continue to split based on original stopping condition with it.

1:  Input: Training set calculated according to (1) and minimum leaf size .
2:  Output: The rectified decision tree .
3:  Calculate pseudo-category of each sample in by (4).
4:  Determine whether the node is pure based on whether each sample in has the same pseudo-category.
5:  if  and the node is not pure then
6:     Calculate the impurity decrease vector according to equation (2).
7:     Select the splitting point with maximum impurity reduction criterion.
8:     The training set correspondingly split into two child nodes, called .
11:  end if
12:  Return: .
Algorithm 1 The training process of ReDT:

3.3 Prediction

Once the ReDT has grown based on the mixed label as described above, the predictions for a newly given sample can be made as follows.

Suppose the unlabeled sample is and the predicted label and predicted discrete probability distribution of that sample is and respectively.

According to a series of decisions, will eventually fall into a leaf node, assuming that node is . The predicted distribution of is the average of the mixed labels of all training samples falling into the leaf nodes , i.e.,


where denotes the number of samples in leaf node .

The predicted label of is the one with biggest probability in :


3.4 Empirical Analysis

The reason why soften labels rather than hard labels should be used can be further demonstrated from the perspective of the calculation of impurity and distribution approximation. The specific analyses are as follows:

Lemma 1 (Integer Partition Lemma).

Suppose there is an integer , which is the sum of integers , i.e.,

There are totally possible values for the ordered pair .


This problem is equivalent to picking locations randomly from locations. The result is trivial based on the basics of number theory. ∎

Lemma 1 indicates that for a -classification problem, if the node contains samples, then the impurity of this node has at most possible values. In other words, compared to soften label, the use of hard label limits the precision of the impurity of the nodes. This limitation has a great adverse effect on the selection of the split point, especially when the number of samples is relatively small.

From another perspective, the improvement brought by soft labels is since it is tough to recover the distribution of with hard labels directly, especially when the number of samples is relatively small. However, once the relatively correct soften a well-trained teacher model provides labels, a large amount of information of the distribution is contained in it. The use of this information about the distribution makes the decision surface offset towards the real position compared to when using the hard label.

3.5 Comparision between DT, SDT and ReDT

Although both soft decision trees (SDT) and ReDT are the extension of DT, there are many differences between them. In this section, we compare DT, SDT and ReDT from five aspects including (1) interpretability, (2) empirical soundness, (3) back-propagation needed, (4) soften label allowed and (5) compression, as shown in Table 1. The method that satisfies the aspect is marked by ✓.

It is worth noting that interpretability, empirical soundness, and compression are relative. For example, the interpretability of SDT is stronger than DNN but is much weaker than DT and ReDT. Besides, since back propagation of the student model is not necessarily required in ReDT, this new knowledge distillation approach can be easily extended to other model and preserving running efficiency.

Empirical soundness
Back-propagation needed
Soften label allowed
Compression (small model size)
Table 1: Comparison between DT, SDT and ReDT.

4 Experiments

4.1 Configuration

For the DNN configurations, the experiments were conducted on the benchmark dataset MNIST [?]. All networks were trained using Adam and an initial learning rate of 0.1. The learning rate was divided by 10 after epochs 20 and 40 (50 epochs in total). We examine a variety of DNN architectures including MLP, LeNet-5, and VGG-11, and use ReLU for activation function, cross-entropy for loss function. The MLP has two hidden layers, with 784 and 256 units respectively and dropout rate 0.5 for hidden layers. Besides, the temperature used in generating soft labels in DNN is set to 4 as suggested in [?].

Data set Class Features Instances
adult 2 14 48842
crx 2 15 690
EEG 2 15 14980
bank 2 17 45211
german 2 20 1000
cmc 3 9 1473
connect-4 3 42 67557
land-cover 9 147 675
letter 26 15 20000
isolet 26 617 7797
Table 2: Datasets description.

All datasets involved in the evaluation of forest-based teacher are obtained from the UCI repository [?]. Their information are listed in Table 2. Besides, data is used for training and other is used for testing. Here we use random forests (RF) and GBDT as the teacher model. They are the representative of the bagging and boosting method in forest-based teacher respectively. We determine the value of by grid search in a step of 0.1 in the range and the implement of GBDT is refer on scikit-learn platform [?]. The number of trees contained in both random forest and GBDT is all set to 100. Besides, the performance of the decision tree trained with hard labels is also provided as a benchmark. Compared with the classical decision tree, since the soft decision tree is more like a tree-shaped neural network and with much weaker interpretability, it is not compared as a benchmark in experiments. The Gini index was used in RF, DT and ReDT as the impurity measure and minimum leaf size is set for both RF, GBDT, DT, and ReDT as suggested in [?].

Besides, 5 times 5-fold cross-validation is used to calculate the soft label of the training set and Wilcoxons signed-rank test [?] is carried out to test for difference between the results of the ReDT and those of decision trees at significance level 0.05. Compared with decision trees, ReDT with better performance (higher accuracy or fewer number of nodes) is indicated in boldface. Those that had a statistically significant difference from the decision tree are marked with ””. Besides, we carried out the experiment 10 times to reduce the effect of randomness.

adult 86.54% 86.53% 81.86% 86.18% 86.16% 0.01 0.06
crx 86.14% 86.09% 80.51% 85.46% 84.40% 0.08 0.11
EEG 81.50% 90.58% 82.88% 83.02% 83.01% 0.24 0.52
bank 90.38% 90.41% 87.60% 90.11% 90.15% 0.06 0.03
german 76.60% 76.13% 68.37% 73.40% 72.67% 0.07 0.10
cmc 55.15% 55.66% 48.31% 55.05% 55.41% 0 0
connect-4 75.38% 77.58% 71.73% 76.69% 76.02% 0.30 0.30
land-cover 83.69% 83.80% 76.55% 77.59% 77.14% 0.54 0.37
letter 91.56% 93.61% 85.65% 86.01% 86.15% 0.9 0.9
isolet 93.68% 93.32% 79.83% 81.40% 81.77% 0.57 0.33
  • : ReDT is better than decision trees at a level of significance 0.05.

  • : The average of all best for each experiment.

Table 3: Comparison of test accuracy of different forests-based teacher model.
adult 244832 1486 7869 2286 2023
crx 6191 1336 103 48 65
EEG 125289 1489 1858 1948 1939
bank 223906 1470 4302 1678 1603
german 11063 1404 227 140 172
cmc 16220 4206 630 202 275
connect-4 470261 4426 18152 8813 8740
land-cover 6100 9492 85 43 49
letter 168101 38631 2752 2464 2459
isolet 58905 32751 707 464 593
  • : ReDT is better than decision trees at a level of significance 0.05.

Table 4: Comparison of the number of nodes of forests-based teacher distillation.

4.2 DNNs Teacher

We discuss the performance including test accuracy (ACC) and the number of nodes (NODE) of ReDT under different teacher models and compare it with DT and its teacher model in this section.

MLP LeNet-5 VGG-11
ACC (DNN) 98.33% 99.42% 99.49%
ACC (DT) 87.55%
NODE (DT) 5957
ACC (ReDT) 88.21% 88.57% 88.53%
Node (ReDT) 5361 5173 5803
  • : ReDT is better than DT at a level of significance 0.05.

Table 5: Comparison on MNIST.

As shown in Table 5, although there is still a gap in ACC between ReDT and its teacher model since decision tree cannot learn the spatial relationships among the raw pixels, the ReDT have a remarkable improvement comparing to the original decision tree. Not to mention that in terms of compression, ReDT even has fewer nodes than DT (and therefore has a smaller model size).

4.3 Forests-based Teacher

Table 3 and 4 shows the test accuracy of different forest-based teacher model and the number of nodes respectively. Regardless of which teacher model is used, the ReDT has a remarkable improvement in both efficiency and compression. Among all ten data sets, ReDT has higher test accuracy than the decision tree, and this improvement is significant on seven of those data sets. Specifically, ReDT has achieved an increase of almost 5% accuracy compared to DT on half of the data sets. In particular, in the three data sets (Band, ADULT, and CONNECT-4), ReDT has similar performance to its teacher model. Also, the value of optimal seems to have some direct connection with the number of categories in the dataset. Specifically, data sets with more categories (such as LAND-COVER, ISOLET, and LETTER) generally have a larger optimal . In other words, a large proportion of hard label needs to be included in the mixed label to give ReDT excellent performance. Two reasons may cause this: 1) The more categories, the more likely the soft labels will contain more error information; 2) The more categories, the higher the interference caused by error information contained in soft labels. Regardless of the reason, the number of categories of samples can be used to provide the initial intuition of . In terms of compression, in nine of ten data sets, the number of nodes in ReDT is smaller than the decision tree. In other words, ReDT has a smaller model size than the decision tree, not to mention the teacher models, such as random forests and GBDT, which is usually more complicated. Overall, ReDT with the forest-based teacher has achieved a significant improvement in both performance and compression.

Figure 1: Visualization of key pixels for MNIST image classification. (The key pixels are marked in red.)

4.4 Discussion

In this section, We discuss the compression, interpretability, and the impact of hyperparameters on the model.


As we mentioned above, ReDT is an extension of a decision tree, and therefore the size of its model can still be measured by the number of nodes. Without loss of generality, we compare ReDT with Decision tree here. There are two advantages for such comparison: 1) The decision tree is almost the model requires the fewest number of nodes in the forest-based algorithms, not to mention its size is much smaller than the DNN or other complex algorithms. If the model has a relatively smaller size compared to the decision tree, then it must have excellent compression; 2) The size of the decision tree and ReDT model are both reflected by the number of nodes, which is convenient for comparison.

Without loss of generality, we use random forest and GBDT as the teacher model here. The compression rate of multiple data sets under different hyperparameter as shown in Fig. 2. The smaller the compression rate, the smaller the model size of ReDT.

It can be seen that the compression rate of almost every dataset is less than 1, which indicates that for all hyperparameter , ReDT has a smaller model size than DT in most cases. In addition, as the hyperparameter increases, the compression rate has an upward trend. This is caused by the fact that the soft label carries a large amount of information about the distribution, whether it is correct or not, thus facilitating the decision tree to divide the data. The smaller the , the more significant the proportion of the soft label in the mixed label, therefore the smaller the size of the model. Thus, although there is no such that it can correspond to the highest test accuracy on all datasets (because this is closely related to complex factors such as the correctness of the soft label, dataset, etc.), using to adjust the size of the model is a good choice. Besides, as shown in the figure, the growth of the compression rate of the data set with more categories is significantly slower. Regardless of the reason, this opposite tendency between compression rate and accuracy (the more the categories, the larger the ) allows ReDT to have a smaller model size when achieving empirical soundness.

Figure 2: Compression rate under different teacher model. (a) Random forests teacher; (b) GBDT teacher.


The decision tree makes the prediction depending on the leaf node to which the input belongs. The corresponding leaf node is determined by traversing the tree from the root. Although its path can represent the decision of the model, when the sample is in high dimensions, especially when it is a picture or speech, a single category will have a large number of different decision paths, and therefore it is difficult to explain the output by simply listing its path. To address this problem, we propose to highlight the key features (pixels) used in the sample’s decision path.

Here, we use MNIST as an example to demonstrate the powerful interpretability of ReDT. We randomly select three samples for each number to predict. The pixels contained in its decision path, the key pixels, are marked in red, as shown in Fig. 1. Although we don’t have the ground true decision path, since the key pixel is almost the outline of the number, so the prediction is with highly interpretability and confidence.

5 Conclusion

By recognizing that the key of learning process lies in the approximation of data distribution, in this paper, we attempt to endow the great approximation ability of other teacher models to decision tree inspired by knowledge distillation and propose the ReDT method. Experiments and comparisons demonstrate that the ReDT remarkably surpasses original decision tree, and its performance is relatively competitive to its teacher model. More importantly, while having good performance, ReDT retains the excellent interpretability of the decision tree and even achieves smaller model size than the decision tree. Besides, in contrast to traditional knowledge distillation, back propagation of the student model is not necessarily required in ReDT, which can be regarded as an attempt of a new knowledge distillation approach. This new knowledge distillation method can be easily extended to other models.


  • [Asuncion and Newman, 2017] Arthur Asuncion and David Newman. UCI machine learning repository, 2017.
  • [Breiman and Shang, 1996] Leo Breiman and Nong Shang. Born again trees. University of California, Berkeley, Berkeley, CA, Technical Report, 1996.
  • [Breiman, 2001] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • [Chen and Guestrin, 2016] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In SIGKDD, pages 785–794. ACM, 2016.
  • [Chen et al., 2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, pages 2172–2180, 2016.
  • [Demšar, 2006] Janez Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research, 7(Jan):1–30, 2006.
  • [Denil et al., 2014] Misha Denil, David Matheson, and Nando De Freitas. Narrowing the gap: Random forests in theory and in practice. In ICML, pages 665–673, 2014.
  • [Feng and Zhou, 2018] Ji Feng and Zhi-Hua Zhou. Autoencoder by forest. In AAAI, pages 2967–2973, 2018.
  • [Friedman, 2001] Jerome H Friedman. Greedy function approximation: A gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
  • [Frosst and Hinton, 2017] Nicholas Frosst and Geoffrey Hinton. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784, 2017.
  • [Han et al., 2015a] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  • [Han et al., 2015b] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NeurIPS, pages 1135–1143, 2015.
  • [Hara and Hayashi, 2018] Satoshi Hara and Kohei Hayashi. Making tree ensembles interpretable: A bayesian model selection approach. In ICAIS, pages 77–85, 2018.
  • [He et al., 2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, pages 1398–1406. IEEE, 2017.
  • [Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [Irsoy et al., 2012] Ozan Irsoy, Olcay Taner Yıldız, and Ethem Alpaydın. Soft decision trees. In ICPR, pages 1819–1822, 2012.
  • [LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [Liu et al., 2008] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In ICDM, pages 413–422, 2008.
  • [Meinshausen, 2010] Nicolai Meinshausen. Node harvest. The Annals of Applied Statistics, pages 2049–2072, 2010.
  • [Nan et al., 2016] Feng Nan, Joseph Wang, and Venkatesh Saligrama. Pruning random forests for prediction on a budget. In NeurIPS, pages 2334–2342, 2016.
  • [Painsky and Rosset, 2016] Amichai Painsky and Saharon Rosset. Compressing random forests. In ICDM, pages 1131–1136, 2016.
  • [Pedregosa et al., 2011] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
  • [Quinlan, 1993] J Ross Quinlan. C4. 5: Programming for machine learning. Morgan Kauffmann, 38:48, 1993.
  • [Ren et al., 2015] Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. Global refinement of random forest. In CVPR, pages 723–730, 2015.
  • [Romero et al., 2015] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
  • [Sabour et al., 2017] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In NeurIPS, pages 3856–3866, 2017.
  • [Scornet et al., 2015] Erwan Scornet, Gérard Biau, Jean-Philippe Vert, et al. Consistency of random forests. The Annals of Statistics, 43(4):1716–1741, 2015.
  • [Yim et al., 2017] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, pages 4133–4141, 2017.
  • [Yosinski et al., 2014] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In NeurIPS, pages 3320–3328, 2014.
  • [Zeiler and Fergus, 2014] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833, 2014.
  • [Zhang et al., 2018] Quanshi Zhang, Wenguan Wang, and Song-Chun Zhu. Examining cnn representations with respect to dataset bias. In AAAI, pages 4464–4473, 2018.
  • [Zhou and Feng, 2017] Zhi-Hua Zhou and Ji Feng. Deep forest: Towards an alternative to deep neural networks. In IJCAI, pages 3553–3559, 2017.
  • [Zhou et al., 2018] Bolei Zhou, David Bau, Aude Oliva, and Antonio Torralba. Interpreting deep visual representations via network dissection. IEEE transactions on pattern analysis and machine intelligence, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description