Rectified Decision Trees: Towards Interpretability,
Compression and Empirical Soundness
Abstract
How to obtain a model with good interpretability and performance has always been an important research topic. In this paper, we propose rectified decision trees (ReDT), a knowledge distillation based decision trees rectification with high interpretability, small model size, and empirical soundness. Specifically, we extend the impurity calculation and the pure ending condition of the classical decision tree to propose a decision tree extension that allows the use of soft labels generated by a welltrained teacher model in training and prediction process. It is worth noting that for the acquisition of soft labels, we propose a new multiple crossvalidation based method to reduce the effects of randomness and overfitting. These approaches ensure that ReDT retains excellent interpretability and even achieves fewer nodes than the decision tree in the aspect of compression while having relatively good performance. Besides, in contrast to traditional knowledge distillation, back propagation of the student model is not necessarily required in ReDT, which is an attempt of a new knowledge distillation approach. Extensive experiments are conducted, which demonstrates the superiority of ReDT in interpretability, compression, and empirical soundness.
Rectified Decision Trees: Towards Interpretability,
Compression and Empirical Soundness
Jiawang Bai^{*}^{*}footnotemark: * , Yiming Li , Jiawei Li , Yong Jiang and Shutao Xia
Graduate School at Shenzhen, Tsinghua University, China
TsinghuaBerkeley Shenzhen Institute, Tsinghua University, China
baijw1020@gmail.com, {liym18, lijw15}@mails.tsinghua.edu.cn, {jiangy, xiast}@sz.tsinghua.edu.cn
1 Introduction
Random forests is a typical ensemble learning method, where a large number of randomized decision trees are constructed and the results from all trees are combined for the final prediction of the forest. Since its introduction in [?], random forests and its several variants [?; ?] have been widely used in many fields, such as deep learning [?; ?] and even outlier detection [?]. In addition to its application, its theoretical properties have also been extensively studied. [?; ?].
However, although those complicated algorithms, such as random forests and GBDT, reach great success in many aspects, this high prediction performance makes considerable sacrifices of interpretability. The essential procedures of ensemble approaches cause this decline. For example, comparing to decision trees, the bootstrap and voting process of random forests makes the predictions much more difficult to explain. On the contrary, the decision trees are known to have the best interpretability among all machine learning algorithms yet with relatively lousy performance. Besides, forestbased algorithms or even deep neural networks (DNN) usually require much larger storage than decision trees, which is unacceptable especially when the model is set on a personal device with strict storage limitations (such as a cellular device). This conflict between empirical soundness and interpretability with flexible storage continuously drives researchers.
To address these problems, using a tree ensemble to generate additional samples for the further construction of the decision tree is proposed by Breiman [?], which can be regarded as the first attempt for this problem. In [?], Node Harvest is introduced to simplify tree ensemble by using the shallow parts of the trees. The shortcoming of Node Harvest is that the simplified derived model is still an ensemble and therefore the challenge of interpretation remains. Recently, a distillationbased method is proposed, where the soften labels are generated by welltrained DNN to create a more understandable model in the form of a soft decision trees [?]. However, since this method relies on the backpropagation of the soft decision trees, it cannot be used in the classical decision trees. Besides, the interpretability of the soft decision trees [?] is much weaker than the classical decision trees.
In this paper, we propose rectified decision trees (ReDT), a knowledge distillation based decision trees rectification with high interpretability, empirical soundness and even has a smaller model size compared to the decision trees. The critical difference between ReDT and decision tree lies in the use of softening labels, which is the weighted average of soft labels (the output probability vector of a welltrained teacher model) and hard labels, in the process of building trees. Specifically, to construct a decision tree, the hard label is mainly involved in the two parts of the tree construction process: (1) calculating the change of impurity and (2) determining whether the node is pure in the stopping condition. In our method, we introduce soften labels into these processes. Firstly, we calculate the average of the soften labels in the node. The proportion of the samples with th category needed in the calculation of impurity criterion is redetermined using the value of the th dimension of the soften label. Secondly, since it is almost impossible for the mixed labels of all samples in a node to be the same, we propose to use a pseudocategory which is corresponding to the soften label of the sample. Then the original stopping condition can remain. In ReDT, the teacher model can be DNN or any other classification algorithm and therefore ReDT is universal. In contrast to traditional knowledge distillation, back propagation of the student model is not necessarily required in ReDT, which can be regarded as an attempt of a new knowledge distillation approach. Besides, we propose a new multiple crossvalidation based method to reduce the effects of randomness and overfitting.
The main contributions of this paper can be stated as follows: 1) We propose a decision trees extension, which is the first tree that allows training and predicting using soften labels; 2) The first universal back propagationfree distillation framework is proposed and 3) the empirical analysis of its mechanism is conducted; 4) We propose a new soften labels acquisition method based on multiple crossvalidations to reduce the effects of randomness and overfitting; 5) Extensive experiments demonstrate the superiority of our approach in interpretability, compression, and empirical soundness.
2 Related Work
The interpretability of complex machine learning models, especially ensemble approaches and deep learning, has been widely concerned. At present, the most widely used machine learning models are mainly forestbased algorithms and DNN, so their interpretability is of great significance. There are a few previous studies on the interpretability of forestbased algorithms. The first work is done by Breiman [?], who propose to use tree ensemble to generate additional samples for the further construction of a single decision tree. In [?], Node harvest is proposed to simplify tree ensembles by using the shallow parts of the trees. Considering the simplification of tree ensembles as a model selection problem, and using the Bayesian method for selection is also proposed in [?]. The interpretability research of DNN mainly on three aspects: visualizing the representations in intermediate layers of DNN [?; ?], representation diagnosis [?; ?] and build explainable DNNs [?; ?]. Recently, a knowledge distillation based method is provided, which uses a trained DNN to create a more explainable model in the form of soft decision trees [?].
The compression of forestbased algorithms and DNN has also received extensive attention. A series of work focuses on pruning techniques for forestbased algorithms, whose idea is to reduce the size by removing redundant components while maintaining the predictive performance [?; ?; ?]. The idea of pruning is also widely used in the compression of DNN [?; ?]. Recently, extensive researches have been conducted on compression methods based on coding or quantization. [?; ?].
Recently, Knowledge distillation has been widely accepted as a compression method. The concept of knowledge distillation in the teacherstudent framework by introducing the teacher’s softened output is first proposed in [?]. Since then, a series of improvements and applications of knowledge distillation have been proposed [?; ?]. At present, almost all knowledge distillation focus on the compression of DNN and require the backpropagation of the student model. Besides, using knowledge distillation to distill DNN into a soften decision tree to achieve great interpretability and compressibility is recently proposed in [?]. This method can be regarded as the first attempt to apply knowledge distillation to interpretability.
3 The Proposed Method
We present the rectified decision trees (ReDT) in this section. The main concepts of our proposed method are how to define the important information of the teacher model (distilled knowledge) and how we use it in training the student model ( the ReDT). Section 3.1 introduces the distilled knowledge that we further used in the construction of ReDT. Section 3.2 and 3.3 discuss the specific construction and prediction process of ReDT. An empirical analysis, which demonstrates why soften labels can reach better performance than hard labels in the construction of the decision tree, is provided in section 3.4.
3.1 Distilled Knowledge
Let represents a data set consisting of observations. Each observation has the form , where represents the dimensional features and is the corresponding label of the observation. The label of a sample can be regarded a single sampling from a dimensional discrete distribution. Let hard label denotes the onehot representation of the label. (dimensional vector, where the value in the dimension corresponding to the category is 1 and the rest are all 0).
From the perspective of probability, the training of the model can be considered as an approximation of the distribution of data. It is extremely difficult to recover the true distribution of from the hard labels directly. In contrast, the output of a welltrained model consists of a significant amount of useful information compared to the original hard label itself. Inspired by this idea, we define the soft label , which is the output probability vector of a welltrained model such as DNN, random forests and GBDT, as the distilled knowledge from the teacher model. This idea is also partly supported by [?] where he used a softened version of the final output of a teacher network to teach information to a small student network.
Once a welltrained teacher model is given, the generation of the soft label is straightforward by directly outputting the probability of all training samples. However, the acquisition of teacher model is usually needed through training. The most straightforward idea is to train the teacher model using all training samples and output the soft label of those samples. However, the soft label obtained through this process has relatively poor quality due to the effects of randomness and overfitting. This problem does not exist in the previous knowledge distillation task since their training is carried out simultaneously rather than strictly one after the other, thanks for the teacher model and the student model can both be trained through back propagation. To address this problem, we propose a multiple crossvalidation based methods to calculate soft labels. Specifically, if 5 times 5fold cross validation is implemented, we first randomly divide the training set into five similarly sized sets, then using four sets of data for training, and the other set of data to predict ( output its soft label). In each time, each sample is predicted once so that each sample will end up with 5 soft labels. And the final soft label is the average of all its predictions.
3.2 ReDT Construction
In the proposed ReDT, comparing to the original decision tree, there are two main alterations including the calculation of impurity decrease and the stopping condition. In our method, we introduce soften label into these processes.
Note that instead of using the soft label of samples directly, we use the mixed label , which is the weighted average of soft label and hard label with weight hyperparameter . That is,
(1) 
The hyperparameter plays a role in regulating the proportion of using the soft label. The larger , the smaller the proportion of the soft label in the mixed label. When , the ReDT becomes Breiman’s decision trees. The purpose of using mixed labels is to consider that the soft label may have a certain degree of error. By adjusting the hyperparameter , we can obtain the soften label with sufficient information and relative accuracy.
Recall that in the classification problem, the impurity decrease caused by splitting point is denoted by
(2) 
where are two children sets generated by splitting at , is the impurity criterion ( Shannon entropy or Gini index). The first alteration in ReDT is the probability , which implies the proportion of the samples with th category, used in calculating the impurity decrease of a splitting point. Specifically, since each sample uses a soften label instead of a hard label, we calculate the average of the soften labels of all the samples in the node and finally obtain a dimensional vector. At this time, is redetermined as the value of the th dimension of that vector. In other words, let denotes the mixed label of th training sample. of node is calculated by
(3) 
where denotes the number of samples in leaf node .
The second alteration is how to define pure in the stopping condition. In the training process of original decision trees, if all samples in a node have a single category, the node is considered to be pure. At this point, the stopping condition is reached, and this node is no longer to split. However, in the ReDT, it is almost impossible for the mixed labels of all samples to be the same. Therefore, we use the category corresponding to the maximum probability in the mixed label of the sample as its pseudocategory , ,
(4) 
and then determining whether to continue to split based on original stopping condition with it.
3.3 Prediction
Once the ReDT has grown based on the mixed label as described above, the predictions for a newly given sample can be made as follows.
Suppose the unlabeled sample is and the predicted label and predicted discrete probability distribution of that sample is and respectively.
According to a series of decisions, will eventually fall into a leaf node, assuming that node is . The predicted distribution of is the average of the mixed labels of all training samples falling into the leaf nodes , i.e.,
(5) 
where denotes the number of samples in leaf node .
The predicted label of is the one with biggest probability in :
(6) 
3.4 Empirical Analysis
The reason why soften labels rather than hard labels should be used can be further demonstrated from the perspective of the calculation of impurity and distribution approximation. The specific analyses are as follows:
Lemma 1 (Integer Partition Lemma).
Suppose there is an integer , which is the sum of integers , i.e.,
There are totally possible values for the ordered pair .
Proof.
This problem is equivalent to picking locations randomly from locations. The result is trivial based on the basics of number theory. ∎
Lemma 1 indicates that for a classification problem, if the node contains samples, then the impurity of this node has at most possible values. In other words, compared to soften label, the use of hard label limits the precision of the impurity of the nodes. This limitation has a great adverse effect on the selection of the split point, especially when the number of samples is relatively small.
From another perspective, the improvement brought by soft labels is since it is tough to recover the distribution of with hard labels directly, especially when the number of samples is relatively small. However, once the relatively correct soften a welltrained teacher model provides labels, a large amount of information of the distribution is contained in it. The use of this information about the distribution makes the decision surface offset towards the real position compared to when using the hard label.
3.5 Comparision between DT, SDT and ReDT
Although both soft decision trees (SDT) and ReDT are the extension of DT, there are many differences between them. In this section, we compare DT, SDT and ReDT from five aspects including (1) interpretability, (2) empirical soundness, (3) backpropagation needed, (4) soften label allowed and (5) compression, as shown in Table 1. The method that satisfies the aspect is marked by ✓.
It is worth noting that interpretability, empirical soundness, and compression are relative. For example, the interpretability of SDT is stronger than DNN but is much weaker than DT and ReDT. Besides, since back propagation of the student model is not necessarily required in ReDT, this new knowledge distillation approach can be easily extended to other model and preserving running efficiency.
DT  SDT  ReDT  

Interpretability  ✓  ✓  
Empirical soundness  ✓  ✓  
Backpropagation needed  ✓  
Soften label allowed  ✓  ✓  
Compression (small model size)  ✓  ✓ 
4 Experiments
4.1 Configuration
For the DNN configurations, the experiments were conducted on the benchmark dataset MNIST [?]. All networks were trained using Adam and an initial learning rate of 0.1. The learning rate was divided by 10 after epochs 20 and 40 (50 epochs in total). We examine a variety of DNN architectures including MLP, LeNet5, and VGG11, and use ReLU for activation function, crossentropy for loss function. The MLP has two hidden layers, with 784 and 256 units respectively and dropout rate 0.5 for hidden layers. Besides, the temperature used in generating soft labels in DNN is set to 4 as suggested in [?].
Data set  Class  Features  Instances 

adult  2  14  48842 
crx  2  15  690 
EEG  2  15  14980 
bank  2  17  45211 
german  2  20  1000 
cmc  3  9  1473 
connect4  3  42  67557 
landcover  9  147  675 
letter  26  15  20000 
isolet  26  617  7797 
All datasets involved in the evaluation of forestbased teacher are obtained from the UCI repository [?]. Their information are listed in Table 2. Besides, data is used for training and other is used for testing. Here we use random forests (RF) and GBDT as the teacher model. They are the representative of the bagging and boosting method in forestbased teacher respectively. We determine the value of by grid search in a step of 0.1 in the range and the implement of GBDT is refer on scikitlearn platform [?]. The number of trees contained in both random forest and GBDT is all set to 100. Besides, the performance of the decision tree trained with hard labels is also provided as a benchmark. Compared with the classical decision tree, since the soft decision tree is more like a treeshaped neural network and with much weaker interpretability, it is not compared as a benchmark in experiments. The Gini index was used in RF, DT and ReDT as the impurity measure and minimum leaf size is set for both RF, GBDT, DT, and ReDT as suggested in [?].
Besides, 5 times 5fold crossvalidation is used to calculate the soft label of the training set and Wilcoxons signedrank test [?] is carried out to test for difference between the results of the ReDT and those of decision trees at significance level 0.05. Compared with decision trees, ReDT with better performance (higher accuracy or fewer number of nodes) is indicated in boldface. Those that had a statistically significant difference from the decision tree are marked with ””. Besides, we carried out the experiment 10 times to reduce the effect of randomness.
Dataset  RF  GBDT  DT  ReDT(RF)  ReDT(GBDT)  (RF)  (GBDT) 

adult  86.54%  86.53%  81.86%  86.18%  86.16%  0.01  0.06 
crx  86.14%  86.09%  80.51%  85.46%  84.40%  0.08  0.11 
EEG  81.50%  90.58%  82.88%  83.02%  83.01%  0.24  0.52 
bank  90.38%  90.41%  87.60%  90.11%  90.15%  0.06  0.03 
german  76.60%  76.13%  68.37%  73.40%  72.67%  0.07  0.10 
cmc  55.15%  55.66%  48.31%  55.05%  55.41%  0  0 
connect4  75.38%  77.58%  71.73%  76.69%  76.02%  0.30  0.30 
landcover  83.69%  83.80%  76.55%  77.59%  77.14%  0.54  0.37 
letter  91.56%  93.61%  85.65%  86.01%  86.15%  0.9  0.9 
isolet  93.68%  93.32%  79.83%  81.40%  81.77%  0.57  0.33 

: ReDT is better than decision trees at a level of significance 0.05.

: The average of all best for each experiment.
Dataset  RF  GBDT  DT  ReDT(RF)  ReDT(GBDT) 

adult  244832  1486  7869  2286  2023 
crx  6191  1336  103  48  65 
EEG  125289  1489  1858  1948  1939 
bank  223906  1470  4302  1678  1603 
german  11063  1404  227  140  172 
cmc  16220  4206  630  202  275 
connect4  470261  4426  18152  8813  8740 
landcover  6100  9492  85  43  49 
letter  168101  38631  2752  2464  2459 
isolet  58905  32751  707  464  593 

: ReDT is better than decision trees at a level of significance 0.05.
4.2 DNNs Teacher
We discuss the performance including test accuracy (ACC) and the number of nodes (NODE) of ReDT under different teacher models and compare it with DT and its teacher model in this section.
MLP  LeNet5  VGG11  
ACC (DNN)  98.33%  99.42%  99.49% 
ACC (DT)  87.55%  
NODE (DT)  5957  
ACC (ReDT)  88.21%  88.57%  88.53% 
Node (ReDT)  5361  5173  5803 

: ReDT is better than DT at a level of significance 0.05.
As shown in Table 5, although there is still a gap in ACC between ReDT and its teacher model since decision tree cannot learn the spatial relationships among the raw pixels, the ReDT have a remarkable improvement comparing to the original decision tree. Not to mention that in terms of compression, ReDT even has fewer nodes than DT (and therefore has a smaller model size).
4.3 Forestsbased Teacher
Table 3 and 4 shows the test accuracy of different forestbased teacher model and the number of nodes respectively. Regardless of which teacher model is used, the ReDT has a remarkable improvement in both efficiency and compression. Among all ten data sets, ReDT has higher test accuracy than the decision tree, and this improvement is significant on seven of those data sets. Specifically, ReDT has achieved an increase of almost 5% accuracy compared to DT on half of the data sets. In particular, in the three data sets (Band, ADULT, and CONNECT4), ReDT has similar performance to its teacher model. Also, the value of optimal seems to have some direct connection with the number of categories in the dataset. Specifically, data sets with more categories (such as LANDCOVER, ISOLET, and LETTER) generally have a larger optimal . In other words, a large proportion of hard label needs to be included in the mixed label to give ReDT excellent performance. Two reasons may cause this: 1) The more categories, the more likely the soft labels will contain more error information; 2) The more categories, the higher the interference caused by error information contained in soft labels. Regardless of the reason, the number of categories of samples can be used to provide the initial intuition of . In terms of compression, in nine of ten data sets, the number of nodes in ReDT is smaller than the decision tree. In other words, ReDT has a smaller model size than the decision tree, not to mention the teacher models, such as random forests and GBDT, which is usually more complicated. Overall, ReDT with the forestbased teacher has achieved a significant improvement in both performance and compression.
4.4 Discussion
In this section, We discuss the compression, interpretability, and the impact of hyperparameters on the model.
Compression
As we mentioned above, ReDT is an extension of a decision tree, and therefore the size of its model can still be measured by the number of nodes. Without loss of generality, we compare ReDT with Decision tree here. There are two advantages for such comparison: 1) The decision tree is almost the model requires the fewest number of nodes in the forestbased algorithms, not to mention its size is much smaller than the DNN or other complex algorithms. If the model has a relatively smaller size compared to the decision tree, then it must have excellent compression; 2) The size of the decision tree and ReDT model are both reflected by the number of nodes, which is convenient for comparison.
Without loss of generality, we use random forest and GBDT as the teacher model here. The compression rate of multiple data sets under different hyperparameter as shown in Fig. 2. The smaller the compression rate, the smaller the model size of ReDT.
It can be seen that the compression rate of almost every dataset is less than 1, which indicates that for all hyperparameter , ReDT has a smaller model size than DT in most cases. In addition, as the hyperparameter increases, the compression rate has an upward trend. This is caused by the fact that the soft label carries a large amount of information about the distribution, whether it is correct or not, thus facilitating the decision tree to divide the data. The smaller the , the more significant the proportion of the soft label in the mixed label, therefore the smaller the size of the model. Thus, although there is no such that it can correspond to the highest test accuracy on all datasets (because this is closely related to complex factors such as the correctness of the soft label, dataset, etc.), using to adjust the size of the model is a good choice. Besides, as shown in the figure, the growth of the compression rate of the data set with more categories is significantly slower. Regardless of the reason, this opposite tendency between compression rate and accuracy (the more the categories, the larger the ) allows ReDT to have a smaller model size when achieving empirical soundness.
Interpretability
The decision tree makes the prediction depending on the leaf node to which the input belongs. The corresponding leaf node is determined by traversing the tree from the root. Although its path can represent the decision of the model, when the sample is in high dimensions, especially when it is a picture or speech, a single category will have a large number of different decision paths, and therefore it is difficult to explain the output by simply listing its path. To address this problem, we propose to highlight the key features (pixels) used in the sample’s decision path.
Here, we use MNIST as an example to demonstrate the powerful interpretability of ReDT. We randomly select three samples for each number to predict. The pixels contained in its decision path, the key pixels, are marked in red, as shown in Fig. 1. Although we don’t have the ground true decision path, since the key pixel is almost the outline of the number, so the prediction is with highly interpretability and confidence.
5 Conclusion
By recognizing that the key of learning process lies in the approximation of data distribution, in this paper, we attempt to endow the great approximation ability of other teacher models to decision tree inspired by knowledge distillation and propose the ReDT method. Experiments and comparisons demonstrate that the ReDT remarkably surpasses original decision tree, and its performance is relatively competitive to its teacher model. More importantly, while having good performance, ReDT retains the excellent interpretability of the decision tree and even achieves smaller model size than the decision tree. Besides, in contrast to traditional knowledge distillation, back propagation of the student model is not necessarily required in ReDT, which can be regarded as an attempt of a new knowledge distillation approach. This new knowledge distillation method can be easily extended to other models.
References
 [Asuncion and Newman, 2017] Arthur Asuncion and David Newman. UCI machine learning repository, 2017.
 [Breiman and Shang, 1996] Leo Breiman and Nong Shang. Born again trees. University of California, Berkeley, Berkeley, CA, Technical Report, 1996.
 [Breiman, 2001] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 [Chen and Guestrin, 2016] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In SIGKDD, pages 785–794. ACM, 2016.
 [Chen et al., 2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, pages 2172–2180, 2016.
 [Demšar, 2006] Janez Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research, 7(Jan):1–30, 2006.
 [Denil et al., 2014] Misha Denil, David Matheson, and Nando De Freitas. Narrowing the gap: Random forests in theory and in practice. In ICML, pages 665–673, 2014.
 [Feng and Zhou, 2018] Ji Feng and ZhiHua Zhou. Autoencoder by forest. In AAAI, pages 2967–2973, 2018.
 [Friedman, 2001] Jerome H Friedman. Greedy function approximation: A gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
 [Frosst and Hinton, 2017] Nicholas Frosst and Geoffrey Hinton. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784, 2017.
 [Han et al., 2015a] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [Han et al., 2015b] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NeurIPS, pages 1135–1143, 2015.
 [Hara and Hayashi, 2018] Satoshi Hara and Kohei Hayashi. Making tree ensembles interpretable: A bayesian model selection approach. In ICAIS, pages 77–85, 2018.
 [He et al., 2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, pages 1398–1406. IEEE, 2017.
 [Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [Irsoy et al., 2012] Ozan Irsoy, Olcay Taner Yıldız, and Ethem Alpaydın. Soft decision trees. In ICPR, pages 1819–1822, 2012.
 [LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [Liu et al., 2008] Fei Tony Liu, Kai Ming Ting, and ZhiHua Zhou. Isolation forest. In ICDM, pages 413–422, 2008.
 [Meinshausen, 2010] Nicolai Meinshausen. Node harvest. The Annals of Applied Statistics, pages 2049–2072, 2010.
 [Nan et al., 2016] Feng Nan, Joseph Wang, and Venkatesh Saligrama. Pruning random forests for prediction on a budget. In NeurIPS, pages 2334–2342, 2016.
 [Painsky and Rosset, 2016] Amichai Painsky and Saharon Rosset. Compressing random forests. In ICDM, pages 1131–1136, 2016.
 [Pedregosa et al., 2011] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
 [Quinlan, 1993] J Ross Quinlan. C4. 5: Programming for machine learning. Morgan Kauffmann, 38:48, 1993.
 [Ren et al., 2015] Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. Global refinement of random forest. In CVPR, pages 723–730, 2015.
 [Romero et al., 2015] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
 [Sabour et al., 2017] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In NeurIPS, pages 3856–3866, 2017.
 [Scornet et al., 2015] Erwan Scornet, Gérard Biau, JeanPhilippe Vert, et al. Consistency of random forests. The Annals of Statistics, 43(4):1716–1741, 2015.
 [Yim et al., 2017] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, pages 4133–4141, 2017.
 [Yosinski et al., 2014] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In NeurIPS, pages 3320–3328, 2014.
 [Zeiler and Fergus, 2014] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833, 2014.
 [Zhang et al., 2018] Quanshi Zhang, Wenguan Wang, and SongChun Zhu. Examining cnn representations with respect to dataset bias. In AAAI, pages 4464–4473, 2018.
 [Zhou and Feng, 2017] ZhiHua Zhou and Ji Feng. Deep forest: Towards an alternative to deep neural networks. In IJCAI, pages 3553–3559, 2017.
 [Zhou et al., 2018] Bolei Zhou, David Bau, Aude Oliva, and Antonio Torralba. Interpreting deep visual representations via network dissection. IEEE transactions on pattern analysis and machine intelligence, 2018.